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Abstract 

This  paper  presents  a  simple  two-step  nonparametric  estimator  for  a  triangular 
simultaneous  equation  model.     Our  approach  employs  series  approximations  that  exploit  the 
additive  structure  of  the  model.     The  first  step  comprises  the  nonparametric  estimation 
of  the  reduced  form  and  the  corresponding  residuals.     The  second  step  is  the  estimation 
of  the  primary  equation  via  nonparametric  regression  with  the  reduced  form  residuals 
included  as  a  regressor.     We  derive  consistency  and  asymptotic  normality  results  for  our 
estimator,   including  optimal  convergence  rates.     Finally  we  present  an  empirical  example, 
based  on  the  relationship  between  the  hourly  wage  rate  and  annual  hours  worked,  which 
illustrates  the  utility  of  our  approach. 

Keywords:     Nonparametric  Estimation,   Simultaneous  Equations,  Series  Estimation,  Two-Step 
Estimators. 
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1.        Introduction 

Structural  estimation  is  important  in  econometrics,   because  it  is  needed  to  account 
correctly  for  endogeneity  that  comes  from  individual  choice  or  market  equilibrium.     Often 
structural  models  do  not  have  tight  functional  form  specifications,   so  that  it  is  useful 
to  consider  nonparametric  structural  models  and  their  estimation.     This  paper  proposes 
and  analyzes  one  approach  to  nonparametric  structural  modeling  and  estimation.     This 
approach  is  different  than  standard  nonparametric  regression,  because  the  object  of 
estimation  is  the  structural  model  and  not  just  a  conditional  expectation. 

Nonparametric  structural  models  have  been  previously  considered  in  Roehrig  (1988), 
Newey  and  Powell  (1989)  and  Vella  (1991).     Roehrig  (1988)  gives  identification  results 
for  a  system  of  equations  when  the  errors  are  independent  of  the  instruments.     Newey  and 
Powell  (1989)  consider  identification  and  estimation  under  the  weaker  condition  that  the 
disturbance  has  conditional  mean  zero  given  the  instruments.     The  results  of  this  paper 
are  complementary  to  these  previous  results.     The  model  we  consider  is  more  restrictive 
in  some  ways  than  Newey  and  Powell  (1989),  but  is  easier  to  estimate. 

The  model  we  consider  is  a  triangular  nonparametric  simultaneous  equations  model 
where 

y    =    or    (x  Z    )    +    G 

(1.1)  °        ^  ,  E[e|u,z]  =  E[e|u],     E[u|z]  =  0, 

X  =  n  (z)  +  u 

X     is  a     d     X  1     vector  of  endogenous  variables,     z     is  a     d    x  1     vector  of  instrumental 

variables  that  includes     z,     as  a     d,,  x  1     subvector,     n„(z)  is  a     d     x  1     vector  of 

1  11  0  X 

functions  of  the  instruments     z,     and     u     is  a     d     x  1     vector  of  disturbances.     Equation 
(1.1)  generalizes  the  limited  information  simultaneous  equations  model  to  allow  for  a 
nonparametric  relationship     g  (x,z  )     between  the  variables     y     and     (x,z  )     and  a 
nonparametric  reduced  form     Ilf^(z). 

The  conditional  mean  restriction  in  equation  (1.1)  is  one  conditional  mean  version 


2 
of  the  usual  orthogonality  condition  for  a  linear  model.        Equation  (1.1)  is  a  more 

general  assumption  than  requiring  that     (e,u)     be  independent  of     z     and  that     E[u]  =  0. 

This  added  generality  may  be  important  for  some  econometric  models,  because  it  allows  for 

conditional  heteroskedasticity  in  the  disturbances.     For  example,  in  some  separable 

demand  models     y     can  be  purchases  of  a  commodity  and     x     expenditures  on  a  subgroup  of 

commodities.     Endogeneity  results  from  expenditure  on  a  commodity  subset  being  a  choice 

variable.     Also,  heteroskedasticity  often  results  from  individual  heterogeneity  in  demand 

functions,   as  pointed  out  by  Brown  and  Walker  (1989).     Assumption  (1.1)  allows  for  both 

endogeneity  and  heteroskedasticity. 

An  alternative  model,   considered  by  Newey  and  Powell  (1989),   requires  only  that 

E[c|z]  =  0.     Strictly  speaking,  neither  model  is  stronger  than  the  other,  because 

3 
equation  (1.1)  does  not  imply  that     E[e|z]  =  0.        The  additive  separability  of     x 

into  a  reduced  form  and  a  residual     u     satisfying  equation  (1.1)  is  a  strong  restriction. 

One  benefit  of  such  a  condition  is  that  it  leads  to  a  straightforward  estimation 

approach,   as  will  be  further  discussed  below.     In  this  sense,  the  model  of  equation  (1.1) 

is  a  convenient  one  for  applications  where  its  restrictions  are  palatable.     In  contrast, 

estimation  is  very  difficult  if  only  the  conditional  mean  assumption     E[e|z]  =  0     is 

4 
satisfied,   as  discussed  in  Newey  and  Powell  (1989). 

We  propose  a  two-step  nonparametric  estimator  of  this  model.     The  first  step 

consists  of  the  construction  of  a  residual  vector     u     from  the  nonparametric  regression 

of     X     on     z.     The  second  step  is  the  nonparametric  regression  of     y     on     x,     z  ,     and 


2 

If     n  (z)     were  linear  in     z     and  the  conditional  expectations  were  replaced  by 

population  regressions,  then  the  third  condition  implies  that     z     is  orthogonal  to     u, 
and  the  second  condition  that     z     is  orthogonal  to     e. 

3 
If  equation  (1.1)  is  satisfied,     u     is  independent  of     z,     and     E[e]  =  0,     then     E[e|z] 

=  E[E[e|u,z]|z]  =  E[E[e|u]|z]  =  E[E[e|u]]  =  E[e]  =  0. 

4 
The  mapping  from  the  reduced  form     E[y|z]     to  the  structure     g-(x,z  )     turns  out  to  be 

discontinuous,  making  it  difficult  to  construct  a  consistent  estimator. 


u.     Two-step  nonparametric  kernel  estimation  has  previously  been  considered  in  Ahn 
(1994).     Our  results  concern  series  estimators,   which  are  convenient  for  imposing  the 
additive  structure  implied  by  equation  (1.1).     We  derive  optimal  mean-square  convergence 
rates  and  asymptotic  normality  results  that  account  for  the  first  step  estimation  and 
allow  for  v'n-consistency  of  mean-square  continuous  functionals  of  the  estimator. 

Section  2  of  the  paper  considers  identification,  presenting  straightforward 
sufficient  conditions.     Section  3  describes  the  estimator  and  Section  4  derives 
convergence  rates.     Section  5  gives  conditions  for  asymptotic  normality  and  inference, 
including  consistent  standard  error  formulae.     Section  6  describes  an  extension  of  the 
model  and  estimators  to  semiparametric  models.     Section  7  contains  an  empirical  example 
and  Monte  Carlo  results  illustrating  the  utility  of  our  approach. 


Identification 


An  implication  of  our  model  is  that  for  '^(-v(u)  =  E[c|u], 


(2.1)  E[y|x,z]  =  gQ^^-^i^  +  E[g|x,2]  =  Eq(->^,z^)  +  E[e|u,z] 


=  ggfx.z^)  +  Aq(u)  =  hQ(w).     w  =  (x',  z'^.  u')', 


Thus,  the  function  of  interest,     g  (x,z  ),     is  one  component  of  an  additive  regression  of 
y     on     (x,z  )     and     u.     Equation  (2.1)  also  implies  that     E[e|u,z]  =  E[y-g-^(x,z  )  lu.z]  = 
E[X   (u)+{y-E[y|x,z]>|u,z]  =  ^^(u)     only  depends  on     u,     which  implies     E[e|u,z]  =  E[e|u], 
as  specified  in  equation  (1.1).     Thus,  the  additive  structure  of  equation  (2.1)  is 
equivalent  to  the  conditional  mean  restriction  of  equation  (1.1).     Furthermore,     u     is 
identified,  so  that  the  identification  of     g_^     under  equation  (1.1)  is  the  same  as  the 
identification  of  this  additive  component  in  equation  (2.1). 

To  analyze  identification  of     g  (x,z  ),     note  that  a  conditional  expectation  is 
unique  with  probability  one,   so  that  any  other  additive  function     g(x,z  )+X(u) 


satisfying  equation  (2.1)  must  have     Prob(g(x,z  )+A(u)  =  g   (x,z  )+A   (u))  =  1.     Therefore, 
identification  is  equivalent  to  equality  of  conditional  expectations  implying  equality  of 
the  additive  components,   up  to  a  constant.     Equivalently,   working  with  the  difference  of 
two  conditional  expectations,   identification  is  equivalent  to  the  statement  that  a  zero 
additive  function  must  have  only  constant  components.     To  be  precise  we  have 


Theorem.  2.1:     gJx.z)     is  identified,  up  to  an  additive  constant,   if  and  only  if 

Prob(8(x,zJ  +  "xCu)  -  0)  =  1     implies  there  is  a  constant     c       with     Prob(8(x,z,)  =  c   ) 
1  gig 


There  is  a  straightforward  interpretation  of  this  result  that  leads  to  a  simple 
sufficient  condition  for  identification.     Suppose  that  identification  fails.     Then  by 
Theorem  2.1,   there  are     5(x,z  )     and     ^'(u)     with     5(x,z  )  +  g-Cu)  =  0     and     5(x,z  ) 
nonconstant.     Intuitively,  this  implies  a  functional  relationship  between     u     and 
(x,z  ),     a  degeneracy  in  the  joint  distribution  of  these  two  random  variables.     For 
instance,   if     2'(u)     were  a  one-to-one  function  of  a  subvector     u      of     u,     then     u    = 
Tf    (5(x,z  )).     More  generally,     6(x,z  )  +  3'(u)  =  0     implies  an  exact  relationship  between 
the  random  vectors     (x,z  )     and     u.     Consequently,  the  absence  of  an  exact  relationship 
will  imply  identification. 

To  formalize  these  ideas  it  is  helpful  to  be  precise  about  what  we  mean  by  existence 
of  a  functional  relationship. 

Definition:     There  is  a  functional  relationship  between     Cx,z J     and     u     if  and  only  if 
there  exists     h(x,z  ,u)     and  a  set     11     such  that     ProbCU.)  >0,     Prob(h(x.z  ,u)  =  0)  =  1 
and     Prob(h(x,z  ,u)  =  0)  <  1     for  all  fixed     u  e  11. 


This  condition  says  that  the  pair  of  vectors     (x,z  )     and     u     solve  a  nontrivial  implicit 
equation.     Thus,  the  functional  relationship  of  this  definition  is  an  implicit  one.     As 
previously  noted,  nonidentification  implies  existence  of  such  an  implicit  relationship. 
Therefore,  the  contrapositive  statement  leads  to  the  following  sufficiency  result  for 


identification. 


Theorem  2.2:     If  there  is  no  functional  relationship  between     (x,z  )     and     u     then 


g  (x,z  )     is  identified  up  to  an  additive  constant. 


Although  it  is  a  sufficient  condition,  nonexistence  of  a  functional  relationship  between 
(x,z  )     and     u     is  not  a  necessary  condition  for  identification.     By  Theorem  2.1,   it  is 
nonexistence  of  an  additive  functional  relationship  that  is  necessary  and  sufficient 
condition.     Thus,   identification  may  still  occur  when  there  is  an  exact,  nonadditive 
functional  relationship. 

The  additive  structure  in  equation  (2.1)  is  so  strong  that  the  model  may  be 
identified  even  when  the  usual  order  condition  is  not  satisfied,   i.e.   even  though     z     has 
smaller  dimension  than     (x,z  ).     For  example,  suppose     x     is  two  dimensional,     z      is  not 
present,     z     is  a  scalar,     n(z)  =  (z,e  ),     and     g(-)(x)     and     '^^((u)     are  restricted  to  be 
differentiable.     In  this  example  there  is  only  one  instrument  although  there  are  two 
endogenous  variables.     The  reduced  form     x  =  n(z)  +  u     implies  a  nonlinear,  nonadditive 
relationship  between     x     and     u     of  the  form     x  -u     =  exp(x  -u  ).     Consider  an  additive 
function     6{x)  +  zM     satisfying 

(2.2)  0  =  5(x)  +  ^(u)  =  5(x^,exp(x^-u^)+U2)  +  ^'("^.^2)  =  0- 

The  restriction  that     gp,(x)     and     A.  (u)     are  differentiable  means  that     5(x)     and     2r(u) 
must  also  be  differentiable.     Let  numerical  subscripts  denote  partial  derivatives  with 
respect  to  corresponding  arguments.     Then  equation  (2.2)  implies 

5^(x)  +  52(x)exp(x^-u^)  =  0,     -52(x)exp(x  -u  )  +  r Au)  =  0,     5  (x)  +  y^^")  =  0. 

Combining  the  last  two  equations  gives 
Z Au)  =  -y  (u)exp(u  -X  ). 


Assuming  that  support  of     x      conditional  on     u     contains  more  than  one  point  with 
probability  one,   so  that     x      can  vary  for  given     u,     it  follows  from  this  equation  that 
with  probability  one     r,(u)  =  7^M  =  0.     Then  the  second  and  first  equations  imply 
5  (x)  =  5  (x)  =  0,     implying  that  both     5     and     y     are  constant  functions.     Thus, 
equation  (2.2)  implies  that     5(x)     is  a  constant  function,  and  hence     g(-,(x)     is 
identified,  up  to  a  constant,  by  Theorem  2.1. 

There  is  a  simple  sufficient  condition  for  identification  that  generalizes  the  rank 
condition  for  identification  in  a  linear  simultaneous  equations  system.     Suppose  that     z 
is  partitioned  as     z  =  (z  ,z   ).     Let     x,     u,      1,     and     2     subscripts  denote 
differentiation  with  respect  to     x,     u,     z  ,     and     z       respectively. 

Theorem  2.3:     If     g(x,z  ),    -XCu),     and     Tl(z)     are  differentiable,  the  boundary  of  the 
support  of     (z,u)     has  zero  probability,  and  with  probability  one     rankCTl  (z))  -  d    , 
then     g^(x,z  )     is  identified. 

If     n(z)     were  linear  in     z     then     rankdl  (z))  =  d       is  precisely  the  condition  for 
identification  of  one  equation  of  a  linear  simultaneous  system,   in  terms  of  the  reduced 
form  coefficients.     Thus,  this  condition  is  a  nonlinear,  nonparametric  generalization  of 
the  rank  condition. 

It  is  interesting  to  note,  as  pointed  out  by  a  referee,  that  this  condition  leads  to 
an  explicit  formula  for  the  structure.     Note  that     h(x,z)  =  E[y|x,z]  =  g(x,z. )  + 
A(x-n(z)).     Differentiating  gives 

h^(z,x)  =  gj^(x,z^)  +  \iu),     h^(x,z^)  =  g^(x.z^)  -  n^(z)'A^(u), 

h^Cx.z)  =  -n2(z)'A^(u), 


Then,  if     rank(n  (z))  =  d       for  almost  all     z,     multiplying     h  (x,z)     by     D(z)  = 

[IT  (z)n  (z)'  ]    n  (z)     and  solving  gives     A  (u)  =  -D(z)h  (x,z).     Plugging  this  result  into 

the  other  terms  gives 


(2.3)  g^(x,z^)  =  h^(x.z)  -  U^{z)'D{z)h^{x,z), 

g  (x,z,)  =  h   (x,z)  +  D(z)h„(x,z). 
^x        1  X  2 


This  formula  gives  an  explicit  equation  for  the  derivatives  of  the  structural  function     g 
in  terms  of  the  identified  conditional  expectations     IT(z)     and     h(x,z). 

So  far  we  have  only  considered  identification  of     g  (x,z  )     up  to  an  additive 
constant.     This  is  sufficient  for  many  purposes,   such  as  comparing     g  {x,z  )     at 
different  values  of     (x,z  ).     In  other  cases,   such  as  forecasting  demand  quantity,   it  is 
also  desirable  to  know  the  level  of     g   (x,z  ),     which  can  be  identified  if  a  location 
restriction  is  imposed  on  the  distribution  of     e.     For  example,  suppose  that  it  is 
assumed  that     E[c]  =  0,     so  that     E[y]  =  E[g  (x,z  )].     Then  from  equation  (2.2)  it 
follows  that  if     t(u)     is  some  function  with     J'T(u)du  =  1, 

(2.4)  jE[yIx,z^,u]T(u)du  -  E[jE[y  |x,z^,u]T(u)du]  +  E[y] 
=  gQ(x,z^)  -  E[gQ(x,z^)]  +  E[y]  =  g^{x,z^). 

Thus,  under  the  familiar  restriction  E[e]  =  0  the  conditions  for  identification  of 
g^(x,z  )  up  to  a  constant,  as  in  equation  (2.2),  also  serve  to  identify  the  level  of 
go(x.z^). 

Another  approach  to  identifying  the  constant  term  is  to  place  restrictions  on 
X^(u).     For  example,  the  restriction     /V  (0)  =  0     is  a  generalization  of  a  condition  that 
holds  in  the  linear  simultaneous  equations  model  because,  in  the  linear  model  where 
g^(x,z  )     and     IKz)     are  linear  and  equation  (1.1)  holds  with  population  regressions 
replacing  conditional  expectations,  the  regression  of     y     on     (x,z  )     and     u     will  be 
linear  in     u.     More  generally,  we  will  consider  imposing  the  restriction     '^(-.(u)  =  A     for 
some  known     u     and     A.     In  this  case 

(2.5)  E[y|x,z^,u]  -  A  =  gQ(x,z^)  +  Aq(u)  -  A  =  g^i-x.,z^). 


Thus,   under  the  restriction     -^f^tu)  =  A,     there  is  a  simple,  explicit  formula  for  the 
structural  function  in  terms  of     h.     The  constant  will  be  identified  under  other  types  of 
restrictions  as  well,   but  this  will  suffice  for  many  purposes. 


3.        Estimation 

Equation  (2.1)  can  be  estimated  in  two  steps  by  combining  estimation  of  the  residual 
u     with  estimation  of  the  additive  regression  in  equation  (2.1).     The  first  step  of  this 
procedure  is  formation  of  nonparametrically  estimated  residuals     u.  =  w.-II(z.), 

(i  =  1 n).     The  second  step  is  estimation  of  the  additive  regression  in  equation 

(2.1),   using  the  estimated  residuals     u.     in  place  of  the  true  ones.     An  estimator  of 
the  structural  function     g   (x,z  )     can  then  be  recovered  by  pulling  out  the  component 
that  depends  on  those  variables. 

Estimation  of  additive  regression  models  has  previously  been  considered  in  the 
literature,  and  we  can  apply  those  results  for  our  second  step  estimation.     There  are  at 
least  two  general  approaches  to  estimation  of  an  additive  component  of  a  nonparametric 
regression.     One  is  to  use  an  estimator  that  imposes  additivity  and  then  pull  out  the 
component  of  interest.     This  approach  for  kernel  estimators  has  been  considered  by 
Breiman  and  Friedman  (1985),   Hastie  and  Tibshirani  (1991),  and  others.     Series  estimators 
are  particularly  convenient  for  imposing  additivity,  by  simply  excluding  interaction 
terms,  as  in  Stone  (1985)  and  Andrews  and  Whang  (1990). 

Another  approach  to  estimating  the  additive  component     g  (x,z  )     of  the  conditional 
expectation  is  to  use  the  fact  that,  for  a  function     t(u)     such  that     jT(u)du  =  1, 

jElylx.z  ,u]T(u)du  =  g  (x,z  )  +  AQ(u)T(u)du. 


Then     g  (x,z)     can  be  estimated  by  integrating,  over     u,     an  unrestricted  estimator  of 


the  conditional  expectation.     This  partial  means  approach  to  estimating  additive  models 

has  been  proposed  by  Newey  (1994),  Tjostheim  and  Auestad  (1994),   and  Linton  and  Nielsen 

(1995).     It  is  computationally  convenient  for  kernel  estimators,   although  it  is  less 

asymptotically  efficient  than  imposing  additivity  for  estimation  of  v^-consistent 

functionals  of  additive  models  when  the  disturbance  is  homoskedastic. 

In  this  paper  we  focus  on  series  estimators  because  of  their  computational 

convenience  and  high  efficiency  in  imposing  additivity.     To  describe  the  two-step  series 

estimator  we  initially  consider  the  first  step.     For  each  positive  integer  L     let     r   (z) 

=  (r  .  (z),...,rj  .  (z))'      be  a  vector  of  approximating  functions.     In  this  paper  we  will 

consider  in  detail  polynomial  and  spline  approximating  functions,  that  will  be  further 

discussed  below.      Let  an     i     subscript  index  the  observations,   and  let     n     be  the  total 

number  of  observations.     Let     IT(z)     be  the  predicted  value  from  a  regression  of     x.     on 

r.  =  r   (z.), 
1  1 

(3.1)  mz)  =  rhz)'y,     r  =  (R'R)~^R'(x, x)'.     R  =  [r,,...,r  ]' . 

In  In 

To  form  the  second  step,   let     p   (w)  =  (p     (w),...,p      (w))'      be  a  vector  of  approximating 

IK.  KK. 

functions  of     w  =  (x',z'  u')'      such  that  each     p     (w)     depends  either  on     (x,z  )     or  on 
u,     but  not  both.     Exclusion  of  the  interaction  terms  will  mean  that  any  linear 
combination  of  the  approximating  functions  has  the  same  additive  structure  as  in  equation 
(2.1),   i.e.   that  additivity  is  imposed.     Also,   let     1(d)     denote  the  indicator  function 
for  the  event     d,  and     t(w)     denote  a  function  of  the  form 

d+d 

(3.2)  t(w)  =  n.    ,'^  l(a.  £  w.  s  b.) 

J=l  J  J         J 

where     a .     and     b .     are  finite  constants,     w .     is  the     jth     component  of     w,     and     d  =  d 

+  d        is  the  dimension  of     (x,z  ).  This     t(w)     is  a  trimming  function  to  be  further 

discussed  below.     Let     u.  =  x.  -  n(z.),     w.  =  (x'.  ,z' .,u'. )',     where  an     i     subscript  for 

11  1  1  1     li     1  ^ 

w     refers  to  the  observation,  and     t.  =  t(w.).     The  second  step  is  obtained  by  regressing 
y.     on     p.  =  p   (w.)     for  each  observation  where     t.  =  1,     giving 
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(3.3)  h(w)  =  p^(w)'p,     p  -  (P'P)  ^P'Y 


P  =  fVi -^nPn^'-    Y  =  (y,.....yJ'. 


The  trimming  function     t(w)     is  convenient  for  the  asymptotic  theory.     It  can  also 

be  used  to  exclude  large  values  of     w     that  might  lead  to  outliers  in  the  case  of 

polynomial  regression.     Although  we  assume  that  the  trimming  function  is  nonrandom,  some 

results  would  extend  to  the  case  where  the  limits     a .     and     b .     are  estimated.     In 

J  J 

particular,   if  these  limits  are  estimated  from  independent  data  (e.g.   a  subsample  that  is 

not  otherwise  used  in  estimation),   then  the  results  described  here  will  still  hold. 

The  estimator     h(w)     can  be  used  to  construct  an  estimator  of  the  structural 

function     g   (x,z  )     and  of    '^^.(u)     by  collecting  those  terms  that  depend  only  on     (x,z  ) 

and  those  that  depend  only  on     u,     respectively.     Suppose  that     p     (w)  =  1     is  the 

IK 

constant,  that     p      (w)     depends  only  on     (x,z  )     for  the  next     K       terms,  and  that  the 

remaining  terms  depend  only  on     u.     Then  the  estimators  can  be  formed  as 

K   +1 
(3.4)  g(x,z^)  =  c^+  l.^^  ^.p.{x,z^),     A(u)  =  c^+  Ij^j^  +2^jP/"^'     ^g  ^  ^X  =  ^V 

These  estimators  are  uniquely  defined  except  for  the  constant  terms     c       and     c  . 

To  complete  the  definition  of  this  estimator  the  constant  terms  should  be  specified. 
For  many  objects  of  estimation,  such  as  predicting  how     g(x,z  )     changes  as     x     or     z 
shifts,  the  constants  are  not  important.     For  example,  if     g(x,z  )     represents  log-demand 
and     x     includes  the  log  of  price,  a  constant  is  not  needed  for  estimation  of 
elasticities  (i.e.  derivatives  of     g).     In  other  cases  the  constant  is  important.     For 
example,  knowing  the  level  of  demand  is  important  for  predicting  tax  receipts. 

Because  of  the  trimming  we  cannot  use  the  most  familiar  condition  that     E[e]  =  0     to 

estimate  the  constants.     However,  the  condition  that     '^p>(u)  =  A,     which  generalizes  the 

linear  model,  can  easily  be  used  to  estimate  the  constants.     Choosing     c^  =  X  - 

A 

Y,--v  j^oP-P-(u)     and     c     =  p,  -  c^     leads  to  an  estimator  satisfying     A(u)  =  X     and 
J-*^ p+'i  J  J  g  1         A 
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equation  (3.4).     In  this  case  it  will  follow  that  the  estimator  of  the  individual 
components  are 

(3.6)  g(x,z  )  =  h(x,2  ,u)  -  A,     A(u)   =  h(x,z  ,u)  -  h(x,z  ,u)  +  A. 

We  consider  two  specific  types  of  series  estimators  corresponding  to  polynomial  and 

spline  approximating  functions.     To  describe  these  let     /i  =  (ji  ,...,/i    )'      denote  a 

'     d^ 
vector  of  nonnegative  integers,   i.e.   a  multi-index,  with  norm      |fi|    =  I]._,fx.,     and  let     z 

=  IT.     (z.)  ■".     For  a  sequence     (i-i(^))o_i     °^  distinct  such  vectors,  a  power  series 
approximation  for  the  first  stage  is  given  by 

rhz)^U^^'^ z^^^V. 

To  describe  a  power  series  in  the  second  stage,   let     (m^^))  denote  a  sequence  of 

multi-indices  with  dimension  the  same  as     w,     and  such  that  for  each     k,     w  depends 

only  on     (x,z  )     or     u,     but  not  both.     This  feature  can  be  incorporated  by  making  sure 
that  the  first     d     components  of     |i(k)     are  zero  if  any  of  the  last     d       components  are 
nonzero,  and  vice  versa.     Then  for  the  second  stage  a  power  series  approximation  can  be 
obtained  as 

K,    ,       ,   fid)         m(k)w 
p    (w)  =  (w'^      .-••.w'^       )   . 

For  the  asymptotic  theory  it  will  be  assumed  that  each  multi-index  sequence  is  ordered 
with  degree  |;i|  increasing  in  i  or  k,  and  that  all  terms  of  a  particular  order  are 
included  before  increasing  the  order. 

The  theory  to  follow  uses  orthonormal  polynomials,  which  may  also  have  computational 
advantages.     For  the  first  step,  replacing  each  power     z       by  the  product  of  univariate 
polynomials  of  the  same  order,  where  the  univariate  polynomials  are  orthonormal  with 
respect  to  some  distribution,   may  lead  to  reduced  collinearity.     The  estimator  will  be 
numerically  invariant  to  such  a  replacement,  because      |  ^lU)  I      is  monotonically 
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increasing.     An  analogous  replacement  could  also  be  used  in  the  second  step. 

Regression  splines  are  smooth  piecewise  polynomials  with  fixed  knots  (join  points). 
They  have  been  considered  in  the  statistics  literature  by  Agarwal  and  Studden  (1980). 
They  are  different  than  smoothing  splines.     For  regression  splines  the  smoothing 
parameter  is  the  number  of  knots.     They  have  attractive  features  relative  to  power  series 
in  that  they  are  less  sensitive  to  outliers  and  to  bad  approximations  over  small  regions. 
The  theory  here  requires  that  the  knots  be  placed  in  the  support  of     z.,     which  therefore 
must  be  known,   and  that  the  knots  be  evenly  spaced.     It  should  be  possible  to  generalize 
results  to  allow  the  boundary  of  the  support  to  be  estimated  and  the  knots  not  evenly 
spaced,  but  these  generalizations  lead  to  further  technicalities  that  we  leave  to  future 
research. 

To  describe  regression  splines  it  is  convenient  to  assume  that     a.  =  -1,     b.  =  1, 
and  that  the  support  of     z     is  a  cartesian  product  of  the  interval     [-1,1].     For  a  scalar 
c     let     (c)       =  l(c  >  0)«c.     An     m         degree  spline  with     J-1     evenly  spaced  knots  on 


[-1,1]     is  a  linear  combination  of 


1    <  ^  <  m+1. 


p.,(u)  =   . 

^  M[u  +   1  -  2(;^-m-l)/J]}™,  m+2  ^  ^  s  m+J 

In  this  paper  we  take     m     fixed  but  will  allow  J     to  increase  with  sample  size.     For  a 

set  of  multi-indices     {fx(£)},     with     fi.(£)  ^  m+J  for  each     j     and     £,     the  approximating 

functions  for     z     will  be  products  of  univariate  splines.     In  particular,  if  the  number 

of  knots  for  the     jth     component  of     z     is     J .,  then  the  approximating  functions  could 
be  formed  as 


(3.7)  rj^(z)  = 


d 

nj=;    '^M/io.j/^j)'  ^'^  =  ''  ••■^^' 


Throughout  the  paper  it  will  be  assumed  that     J.     depends  on     K     in  such  a  way  that  the 
ratio  of  numbers  of  knots  for  each  pair  of  elements  of     z     is  bounded  above  and  away  from 
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zero.     Spline  approximating  functions  in  the  second  step  could  be  formed  in  the  analogous 
way,  from  products  of  univariate  splines,  with  additivity  imposed  by  only  including  terms 
that  depend  only  on  one  of     (x,z  )     or     u. 

The  theory  to  follow  uses  B-splines,  which  are  a  linear  transformations  of  the  above 
functions  that  have  lower  multicollinearity.     The  low  multicoUinearity  of  B-splines  and 
recursive  formula  for  calculation  also  lead  to  computational  advantages;   e.g.   see  Powell 
(1981). 


4.        Convergence  Rates 


In  this  Section  we  derive  mean-square  error  and  uniform  convergence  rates.     To 

obtain  results  it  is  important  to  impose  some  regularity  conditions.     Let     X  =  (x,z),     tj 

1/2 
=  y  -  h   (w),     and     u  =  x  -  TT  (z).     Also,  for  a  matrix     D     let     IIDII   =  [trace(D'D)] 

V    1/v 
for  a  random  matrix     Y,      IIYII     =  {E[IIYir]r     ,     v  <  oo,     and     IIYII       the  infimum  of 

V  00 

constants     C     such  that     Prob( IIYII  <  C)  =  1. 


Assumption  1:     {(y.,x.,z.)},      (i  =  1,  2,   ...)     is  i.i.d.   and     Var(x|z)     and     Var(y|X) 
are  bounded. 


The  bounded  second  conditional  moment  assumption  is  quite  common  in  the  series  estimation 
literature  (e.g.  Stone,   1985).     Relaxing  it  would  lead  to  complications  that  we  wish  to 
avoid. 

Next,  we  impose  the  following  requirement.     Let     W  =  {w  :  t(w)  =  1>. 

Assumption  2:     z     is  continuously  distributed  with  density  that  is  bounded  away  from  zero 
on  its  support,  and  the  support  of     z     is  a  cartesian  product  of  compact,  connected 
intervals.     Also,     w     is  continuously  distributed  and  the  density  of     w     is  bounded  away 
from  zero  on     W,  and     W     is  contained  in  the  interior  of  the  support  of     w. 
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This  assumption  is  useful  for  deriving  the  properties  of  series  estimators  like  those 
considered  here  (e.g.   see  Newey,   1997).     It  allows  us  to  bound  below  the  eigenvalues  of 
the  second  moment  matrix  of  the  approximating  functions.     Also,   an  identification 
condition  like  those  discussed  in  Section  2  is  embodied  in  this  assumption.     The  density 
of     w     being  bounded  away  from  zero  means  that  there  is  no  functional  relationship 
between     (x,z  )     and     u,     so  that  by  Theorem  2.2  identification  holds.     This  assumption, 
along  with  smoothness  conditions  discussed  below,   imply  that  the  dimension  of     z     must  be 
at  least  as  large  as  the  dimension  of     (x,z  ),     a  familiar  order  condition  for 
identification. 

To  see  why  Assumption  2  implies  the  order  condition,   note  that  the  density  of     w 
is  bounded  away  from  zero  on  an  open  set.     Then,   because     w     is  a  one-to-one  function  of 
x,   z  ,     and     IKz),     the  density  of     (z  ,IT(z))     must  be  bounded  away  from  zero  on  an  open 
set.     But     (z  ,n(z))     is  a  vector  function  of     z     that  will  be  smooth  under  assumptions 
given  below,  so  its  range  must  be  confined  to  a  manifold  of  smaller  dimension  than 
(z  ,n(z))     unless     z     has  at  least  as  many  components  as     (z  ,n(z)).     Since  the  dimension 
of     n(z)     and     x     are  the  same,     z     must  have  as  many  components  as     (x,z  ). 

Some  discreteness  in  the  components  of     z     can  be  allowed  for  without  affecting  the 
results.     In  particular.   Assumption  2  can  be  weakened  to  hold  only  for  some  component  of 
the  distribution  of     z.     Also,  one  could  allow  some  components  of     z     to  be  discretely 
distributed,  as  long  as  they  have  finite  support.     On  the  other  hand,   it  is  not  easy  to 
extend  our  results  to  the  case  where  there  are  discrete  instruments  that  take  on  an 
infinite  number  of  values. 

Some  smoothness  conditions  are  useful  for  controlling  the  bias  of  the  series 
estimators. 

Assumption  3:     ^  (z)     is  continuously  differentiable  of  order     s      on  the  support  of     z 
and     g  (x,z  )     and     /V   (u)     are  Lipschitz  and  continuously  differentiable  of  order     s     on 
W. 
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The  derivative  orders     s      and     s     control  the  rate  at  which  polynomials  or  splines 

approximate  the  function.     In  particular,   the  rate  of  approximation  for     g  (x.z  )     and 

-s/d 
A.(u),     and  hence  also  for     h   (w),     will  be     0(K         ),     where     d     is  the  dimension  of 

(x.Zj^). 

The  last  regularity  condition  restricts  the  rate  of  growth  of  the  number  of  terms     K 

and     L.     Recall  that     d      denotes  the  dimension  of     z. 

Assumption  4:     Either  a)  for  power  series,   (K     +  K  L)[(L/n)       +L     i     i)  — >  0;   or     b) 
for  splines.   (K^  +  KL^^^)[(L/n)^''^+L"^/^il  -^  0. 


For  example,  for  splines,   if     K     and     L     grow  at  the  same  rate  then  this  condition 
requires  that  they  grow  slower  than     n       ,      and  that     s     >  2d  . 

As  a  preliminary  step  it  is  helpful  to  derive  convergence  rates  for  the  conditional 
expectation  estimator     h(w).      Let     F       denote  the  distribution  function  of     w. 

Lemma  4.1:  If  Assumptions  1-4  are  satisfied  then 

ST(w)[h(w)-hJw)]^dFJw)  =  0  (K/n  +  K~^^^^  +  L/n  +  L~^^/^i). 
0  0  p 


Also,  for     q  =  1/2    for  splines,  and     q  =  1     for  power  series, 

sup      ,.Ah(w)  -  hJw)\   =  0  (k'^KKM)^^^  +  K^^^  +  (L/n)^^^  +  L~^/^i]). 
w&W  0  p 

This  and  all  subsequent  proofs  are  given  in  the  Appendix. 

This  result  leads  to  convergence  rates  for  the  structural  estimator     g(x,z  )     given 
in  equation  (3.4).     We  will  treat  the  constant  term  a  little  differently  for  the  mean 
square  and  uniform  convergence  rates,  so  it  is  helpful  to  state  and  discuss  them 
separately.     For  mean  square  convergence  we  give  a  result  for  mean-square  convergence  of 
the  de-meaned  difference  between  the  estimator  and  the  truth.     Let     t  =  E[t(w)]. 
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Theorem  4.2:     If  Assumptions  1-4  are  satisfied  then  for     h(x,z  )  =  g(x,z  )-g  (x,z  ) 

ST(w)[kx,z^)-ST(w)l(x,z^)Fj:dw)/xl^FQ(dw)  =  0   (K/n  +  k"^^''^  +  L/n  +  L~^^/\). 

The  convergence  rate  is  the  sum  of  two  terms,   depending  on  the  number     L     of 
approximating  functions  in  the  first  step  and  the  number     K     in  the  second  step.     Each  of 
these  two  terms  has  a  form  analogous  that  derived  in  Stone  (1985),   which  would  attain  the 
optimal  mean-square  convergence  rate  for  estimating  a  conditional  expectation  if     K     and 
L     were  chosen  appropriately.     In  particular,   if     K     is  chosen  proportional  to     n 
and     L     proportional  to     n  i       i       i  ,     and  Assumption  4  is  satisfied,  then  the 
conclusion  of  Theorem  4.2  is 

(4.1)     ST(w)Lkx,zJ-ST(w)kx.zJFJ:dw)/TfFJdw)  =  0  {max{n~^^^^'^^^^\  n~^^/^V^^l^). 
1  10  0  p 

From  Stone  (1982)  it  follows  that  the  convergence  rates  in  this  expression  are  optimal 
for  their  respective  steps,   i.e.     n  would  be  optimal  for  estimation  of 

g-^(x,z  )     if     u     did  not  have  to  be  estimated  in  the  first  step.     Thus,  when     K     and     L 
are  chosen  appropriately,  the  mean  square  error  convergence  rate  of  the  second  step 
estimator  is  the  slower  of  the  optimal  rate  for  the  first  step  and  the  optimal  rate  for 
the  second  step  that  would  obtain  if  the  first  step  did  not  have  to  be  estimated. 

An  important  feature  of  this  result  is  that  the  second  step  convergence  rate 
n  is  the  optimal  one  for  a  function  of  a  d-dimensional  variable  rather  than 

the  slower  rate  that  would  be  optimal  for  estimation  of  a  general  function  of     w,     where 
additivity  was  not  imposed.     This  occurs  because  the  additivity  of     h-^(w)     is  used  in 
estimation.     It  is  essentially  the  exclusion  of  interaction  terms  that  leads  to  this 
result,  as  discussed  in  Andrews  and  Whang  (1990). 

Although  a  full  derivation  of  bounds  on  the  rate  of  convergence  of  estimators  of 
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g   (x,z  )     is  beyond  the  scope  of  this  paper,   something  can  be  said.        It  is  known  from 
Stone  (1982)  that     n  is  the  best  attainable  mean-square  error  (MSE)  rate  for 

estimation  of     g   (x,z  )     when     u     is  known.     Since  the  best  rate  when     u     is  unknown 
cannot  be  better  than  the  best  rate  when     u     is  known,     g     must  attain  the  best  rate  when 
the  rate  is     n  .     Therefore,  for  the  ranges  of     d  ,   s  ,  d,   s     where  this  rate  is 

attained  we  know  that  it  is  the  best  one  (and  that  the  estimator  has  the  best  rate). 

o  ...         ,,  J     ,      .     u  ^-       ,  ^     *u         *  d/(d+2s)  ,       d /(d +2s  )     ,,    ^ 

Setting     K     and     L     to  be  proportional  to  the  rates     n  and     nil       i      that 

minimize  each  component  of  the  mean-square  error  it  follows  that  the  optimal  second  stage 

rate  will  be  attained  for  splines  when     s/d  ^  ^i/d.     and     s/d  >  2.     For  power  series  the 

side  conditions  of  Assumption  4  are  even  stronger  and  would  narrow  the  range  of     s/d     and 

s/d      where  the  estimator  could  attain  the  best  convergence  rate  for  the  second  step. 

The  condition     s/d  ^  s/d      means  that  the  second  stage  is  at  least  as  smooth 
(relative  to  its  dimension)  as  the  first  stage.     We  do  not  know  what  the  optimal 
convergence  rate  would  be  when  the  first  stage  is  less  smooth  than  the  second  stage, 
although  in  that  case  we  know  that  the  first  stage  convergence  rate  would  dominate  the 
mean-square  error  of  our  estimator  if     K     and     L     are  chosen  to  minimize  the  mean-square 
error  in  Theorem  4.1  or  4.2.     Our  estimator  could  conceivably  be  suboptimal  in  that  case. 
Also,  the  side  conditions  in  Assumption  4  are  quite  strong  and  rule  out  any  possibility 
of  optimality  of  our  estimator  when  neither  the  first  or  the  second  stage  is  very  smooth 
(e.g.   when     s/d  ^  1). 

It  is  interesting  to  note  that  the  convergence  rates  in  equation  (4.1)  are  only 
relevant  for  the  function  itself  and  not  for  derivatives.     Because     u     is  plugged  into 
h(w),     one  might  think  that  convergence  rates  for  derivatives  are  also  important,   as  in  a 
Taylor  expansion  of     h(u)     around     h(u).     The  reason  that  they  are  not  is  that     u 
depends  only  on  the  conditioning  variables     x     and     z     in  equation  (2.1),  a  feature  that 


We  thank  a  referee  for  pointing  this  out  and  for  giving  a  slightly  more  general  version 
of  the  following  discussion  of  optimality. 
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allows  us  to  work  with  a  Taylor  expansion  of     h   (w)     rather  than     h(w)     in  the  proofs. 

Theorem  4.2  gives  convergence  rates  that  are  net  of  the  constant  term.     It  would  also 
be  interesting  to  know  if  similar  results  hold  when  the  constant  term  is  included.     That 
will  depend  on  the  identifying  assumption  for  the  constant  term.     With  the  trimming  the 
usual  restriction     E[e]  =  0     does  not  identify  the  constant.     Also,  the  restriction 
A   (u)  =  A     is  a  pointwise  one,   so  it  does  not  lead  to  mean-square  error  convergence 
rates.     It  would  be  possible  to  obtain  mean-square  rates  that  include  the  constant  term 
by  specifying  a  restriction     E[t(w)c]  =  0,     but  this  condition  does  not  seem  to  be  well 
motivated  in  the  model  and  so  will  not  be  considered. 

It  is  straightforward  to  obtain  uniform  convergence  rates  for     g(x,z  )     from  the 
uniform  convergence  rates  for     h(w)     in  Lemma  4.1.     In  particular,  note  that     g(x,z  )     can 
be  recovered  from     h(w),     up  to  the  constant  term,  by  just  fixing     u     at  some  value  (in 
W),     so  that  uniform  rates  follow  immediately.     Furthermore,   as  long  as  the  restriction 
on     'Vp,(u)     used  to  identify  the  constant  term  is  continuous  in  the  supremum  norm  on     W, 
uniform  rates  for     g     will  also  follow  from  Lemma  4.1.     In  particular,  when  the 
identification  restriction     '^p)(u)  =  X     is  imposed  leading  to  an  estimator  as  in  equation 
(3.6),  uniform  convergence  rates  for     g     follow  immediately  from  those  for     h. 
Specifically,  for     W      equal  to  the  coordinate  projection  of     W     on  values  of     (x,z  ), 

Theorem  4.3:     If     g(x,z  )  =  h(x,z  ,u)  -  A,     ^f^(^)  =  ^>     ^i^d  Assumptions  1-4  are  satisfied 
then  for     q  =  1/2     for  splines,  and     q  =  1     for  power  series, 

^^P(xz)&-W  ^S(x,z^)  -  gQ(x,zp\   =  0^(K'^[(K/n)^^^  +  k"^^"*  +  (L/n)^^^  +  L^/\]). 

This  uniform  convergence  rate  cannot  be  made  equal  to  Stone's  (1982)  bounds  for  either 
the  first  or  second  steps,  due  to  the  leading     K       term.     Nevertheless,  these  uniform 
rates  improve  on  some  in  the  literature,  e.g.  on  Cox  (1988),  as  noted  in  Newey  (1997). 
Apparently,  it  is  not  yet  known  whether  series  estimators  can  attain  the  optimal  uniform 
convergence  rates. 

The  uniform  convergence  rate  for  polynomial  regression  estimators  is  slower  and  the 
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conditions  on     K     and     L     more  stringent  for  power  series  than  splines.     Nevertheless  we 
retain  the  results  for  polynomials  because  they  have  some  appealing  practical  features. 
They  are  often  used  for  increased  flexibility  in  functional  form.     Also  polynomials  do 
not  require  knot  placement,   and  hence  are  simpler  to  use  in  practice. 

We  could  also  derive  convergence  rates  for     A(u),     and  obtain  convergence  rates 
identical  to  those  in  Theorems  4.2  and  4.3.     We  omit  these  results  for  brevity. 


5.        Inference 

Large  sample  confidence  intervals  are  useful  for  inference.     We  focus  on  pointwise 
confidence  intervals  rather  than  uniform  ones,   as  in  much  of  the  literature.     For  this 
purpose  let     h     denote  a  possible  true  function     h   (w),     where  the     w     argument  is 
suppressed  for  convenience.     That  is,  the  symbol     h     represents  the  entire  function.     We 
consider  all  linear  functionals  of     h,     which  turn  out  to  include  the  value  of  the 
function     g   (x,z  )     at  a  point.     We  construct  confidence  intervals  using  standard  errors 
that  account  for  the  nonparametric  first  step  estimation. 

To  describe  the  standard  errors  and  associated  confidence  intervals  it  is  useful  to 
consider  a  general  framework.     For  this  purpose  let     h     denote  a  possible  true  function 
h  (w),     where  the     w     argument  is  suppressed  for  convenience.     That  is,  the  symbol     h 
represents  the  entire  function.     Let     a(h)     be  a  particular  number  associated  with     h. 
As  a  function  of     h,     a(h)     represents  a  mapping  from  possible  functions  of     h     to  the 
real  line,   i.e.   a  functional.     The  object  of  interest  will  be  assumed  to  be  the  value 
G     =  aCh^)     of  the  functional  at     h  .     In  this  Section  we  develop  standard  errors  for  the 
estimator     Q  =  a(h)     of     G       and  give  asymptotic  normality  results  that  allow  formation 
of  large  sample  confidence  intervals. 

This  framework  is  general  enough  to  include  many  objects  of  interest,  such  as  the 
value  of     g     at  a  point.     For  example,  under  the  restriction     '^f^(u)  =  X     discussed 
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earlier,  the  value  of     g  (x,z  )     at  a  point     (x,z  )     can  be  represented  as 

(5.1)  g  (x.z  )  =  a(h   ),     a(h)  =  h(x,z    u)  -  X. 

A  general  inference  procedure  for  linear  functionals  will  apply  to  the  functional  of  this 
equation,   i.e.  can  be  used  in  the  construction  of  large  sample  confidence  intervals  for 
the  value  of     g       at  a  point. 

Another  example  is  a  weighted  average  derivative  of     g   (x,z  ).     This  object  is 
particularly  interesting  when     g       is  an  index  model  that  depends  on     w    =  (x,z  )     only 
through  a  linear  combination     w  2^,     say     g  (w  )  =  t(w  3')     where     t(q)     is  a 
differentiable  function  with  derivative     t   (q).     Then  for  any     v(w)     such  that 
v(w)t  (w'  ?-}     is  integrable, 

(5.2)  J"v(w)[ah„(w)/aw,]dw  =  [ft  (w,^)v(w)dw]3', 

0  1  q    1 


so  that  the  weighted  average  derivative  is  proportional  to     j.     This  is  also  a  linear 
functional  of     h  ,     where     a(h)  =  Xv(w)[5h(w)/9w  ]dw.     This  functional  may  also  be  useful 
for  summarizing  the  slope  of     g/^(w  )     over  some  range,  as  discussed  in  Section  7.     Unlike 
the  value  of     g  (w  )     or     9g   (w  )/5w      at  a  point,  the  weighted  average  derivative 
functional  will  be  ^^-consistent  under  regularity  conditions  discussed  below. 

We  can  derive  standard  errors  for  linear  functionals  of     h,     which  is  a  class 
general  enough  to  include  many  important  examples,   such  as  the  value  of     g     at  a  ppint  or 
a  weighted  average  derivative.     The  estimator     Q  =  a(h)     of     6     =  a(h^)     is  a  natural, 
"plug-in"  estimator.     For  example,  the  value  of     g       at  a  point  could  be  estimated  by 
applying  the  functional  in  equation  (5.1)  to  obtain     g(x,z  )  =  a(h)  =  h(x,z  ,u)  -  X.     In 
general,  linearity  of     a(h)     implies  that 

(5.3)  G  =  AP,     A  =  (a(p^j^) a(pj^j^)). 

Because     G     is  a  linear  combination  of  the  two-step  least  squares  coefficients     p,     a 
natural  estimator  of  the  variance  can  be  obtained  by  applying  the  formula  for  parametric 
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two  step  estimators  (see  Newey  1984).     As  for  other  series  estimators,   such  as  in  Newey 
(1997),   this  should  lead  to  correct  inference  asymptotically,   because  it  accounts 
properly  for  the  variability  of  the  estimator.     Let     ©     denote  the  Kronecker  product  and 

(5.4)  Q  =  P'P/n,     t  =  y.",T.p.p'.[y.-h(w.)]^/n,     Q    =  I        ®(R'R/n), 

^1=1   111     11  1         d-k 

1 


E,  =  7.",(u.u'.)®(r.r'.)/n,     H  =  y.   ,T.[Sh(w.)/5u]'®p.r'./n. 
1       ^1=1     11         11  ^1=1  1  1  11 


The  variance  estimate  for     a(h)     is  then 


(5.5)  V  =  AQ  ht  +  HQ^^Z^Q^^H' ]Q  ^A' , 


This  estimator  is  like  that  suggested  by  Newey  (1984)  for  parametric  two-step 
regressions.     It  is  equal  to  a  term     AQ    ZQ    A' ,     that  is  the  standard  heteroskedasticity 
consistent  variance  estimator,  plus  an  additional  nonnegative  definite  term  that  accounts 
for  the  presence  of     u.     It  has  this  additive  form,  where  the  variance  of  the  two-step 
estimator  is  larger  than  the  first,  because  the  second  step  conditions  on  the  first  step 
dependent  variables,  making  the  second  and  first  steps  orthogonal. 

/-V-1/2   ''  d 

We  will  show  that  under  certain  regularity  conditions     vnV         (0-9   )  — >  N(0,1). 
Thus  doing  inference  as  if     9     were  distributed  normally  with  mean     9       cind  variance     V/n 
will  be  a  correct  large  sample  approximation.     This  extends  large  sample  inference 
results  of  Andrews  (1991)  and  Newey  (1997)  to  account  for  an  estimated  first  step. 
However,  as  in  this  previous  work,  such  results  do  not  specify  the  convergence  rate  of 
a(h).     That  rate  will  depend  on  how  fast     V     goes  to  infinity.     To  date  there  is  still 
not  a  complete  characterization  of  the  convergence  rate  of  a  series  estimator  available 
in  the  literature,  although  it  is  known  when  v^-consistency  occurs  (see  Newey,   1997). 
Those  results  can  be  extended  to  obtain  \^-consistency  of  functionals  of  the  two-step 
estimators  developed  here.     Let     II  •  II     =  (E[t(w)(-)   J/t)  denote  a  trimmed  mean-square 

norm  and     P     denote  the  set  of  functions  that  can  be  approximated  arbitrarily  well  in 

K 

this  norm  by  a  linear  combination  of     p   (w)     for  large  enough     K.     It  is  well  known,  for 
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p    (w)     corresponding  to  power  series  or  splines,   that     T     includes  all  additive  functions 
in     (x,z  )     and     u     where  the  individual  components  have  finite     II  •  II        norm.     The 
critical  condition  for  \^-consistency  is  that  there  is     i'(w)  e  T     such  that 

(5.6)  a(h)  =  E[T(w)i;{w)h(w)], 

for  any     h(w)  G  T.     Essentially  this  means  that     a(h)     can  be  represented  as  an  expected 
product  of     h(w)     with  a  function     p(w)     that  is  in  the  same  space  as     h.     By  the  Riesz 

representation  theorem,  this  is  the  same  as  the  functional     a(h)     being  continuous  with 

2    — 

respect  to  the  mean  square  error     E[t(w)(*)   ]/t.      In  this  case     VnCe    -  6    )     will  be 

asymptotically  normal,   and  its  asymptotic  variance  can  be  given  an  explicit  expression. 

To  describe  the  asymptotic  variance  in  the  v^-consistent  case,   let     p(z)  = 
E[T(w)y(w)5h   (w)/9u'  |z].     The  asymptotic  variance  of  the  estimator  can  then  be  expressed 
as: 

(5.7)  V  =  E[T(w)i^(w)p(w)'Var(y|X)]  +  E[p(z)Var(x|z)p(z)' ]. 

For  example,  for  the  weighted  average  derivative  in  equation  (5.2),   if     v(w)     is 
zero  where     t(w)     is  zero,  so     t(w)v(w)  =  v(w),     then  integration  by  parts  shows  that 
equation  (5.6)  is  satisfied  with     i'(w)  =  -Proj(f   (w)     Sv(w)/5x|7'),     where     fp,(w)     is  the 
density  of     w     and     Proj(  •  |  y)     denotes  the  mean-square  projection  on  the  space  of 
functions  that  are  additive  in     w      and     x.     The  presence  of     Proj('|?')     is  necessary  to 
make     t'(w)     lie  in  the  space  spanned  by     p   (w)     for  large     K,     that  is  the  set  of 
functions  that  are  additive  in     w      and     x.     Then 

p(z)  =  -E[T{w){Proj(fQ(w)"Ww)/5x|?')}ahQ(w)/5u' |z], 

and  the  asymptotic  variance  will  be     V     from  equation  (5.7). 

To  state  precisely  results  on  asymptotic  normality  it  is  helpful  to  introduce  some 
regularity  conditions.     Recall  that     t}  =  y-h  (w). 
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2  4 

Assumption  5:     cr  (X)   =  Var(y|X)     is  bounded  away  from  zero,     E[t)    |X]     is  bounded,   and 

4 
E[llull    |X]     is  bounded.     Also,     h   (w)     is  twice  continuously  differentiable  in     u     with 

bounded  first  and  second  derivatives. 

This  condition  strengthens  the  bounded  conditional  second  moment  condition  to  boundedness 
of  the  fourth  conditional  moment  and  the  conditional  variance  being  bounded  away  from 
zero. 

The  next  assumption  is  the  key  condition  for  v^-consistency  of  the  estimator.     It 
requires  that  the  functional     a(h)     have  a  representation  as  an  expected  product  form,   at 
least  for  the  truth     h       and  the  approximating  functions. 

2 

Assumption  6:     There  exists-    vCw)     and     (3        such  that     ElxtwjIlt'Cw)!!    ]  <  oo,      a(h    )  = 

K  O 

E[T(w)y(w)h„(w)],     a{p,  ^)  =  E[t(w)i;(w)p,  ^(w)],     E[T(w)lly(w)-P-^p'^(w)ll^]  ^  0     as     K  ^  co. 
0  kK  kK  K 

This  condition  also  requires  that     uCw)     can  be  approximated  well  by  a  linear  combination 

p   (w)     for  large     K,     which  is  important  for  the  precise  form  that     v(w)     must  take.     For 

example,  with  the  average  derivative  this  condition  leads     v{w)     to  be  the  projection  of 

f   (w)    5v(w)/5x     on  functions  that  are  additive  in     w      and     u. 

The  next  condition  is  complementary  to  Assumption  6.     For     d      =  d  +  d       and  a 

d 
d      x  1     vector     fi     of  nonnegative  integers  let      \y.\   =  E-_iM-     5Ti(w)  = 

8       h(w)/Sw  •  •  •  3w      .     Also  let     5     denote  a  nonnegative  integer  and      I  h  I  s  = 
w 

max ,    .  ^jSup      ...  I  a^h(w)  | . 
I  /I  I  £5       w€M' 

Assumption  7:     a(h)     is  a  scalar,      |  a(h)  |    s   |  h  |  _     for  some     5^0,       and  there  exists     P^ 

o  K. 

K  K  2 

such  that  as     K  — >  oo,     a(p    'P    )  is  bounded  away  from  zero  while     E[t(w){p   (w)'p„}  ]  — >  0. 


This  assumption  says  that  functional     a(h)     is  continuous  in     |h|_,     but  not  in  mean 

o 

square  error.     The  lack  of  mean-square  continuity  will  imply  that  the  estimator     a(h)     is 
not  \^-consistent,  and  is  also  a  useful  regularity  condition.     Another  restriction 
imposed  is  that     a(h)     is  a  scalar,  which  is  general  enough  to  cover  many  cases  of 
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interest.      When     a(h)     is  a  vector,   asymptotic  normality  with  an  estimated  covariance 
matrix  would  follow  from  Assumption  J  iii)  of  Andrews  (1991).     That  assumption  of  Andrews 
(1991)  is  difficult  to  verify.     In  contrast,  Assumption  7  is  a  primitive  condition,  that 
is  relatively  easy  to  verify.     This  condition  and  Assumption  6  are  mutually  exclusive, 
and  general  enough  to  include  many  cases  of  interest.     For  example,  as  noted  above, 
Assumption  6  includes  the  weighted  average  derivative.     For  another  example,  Assumption  7 
obviously  applies  to  the  functional     a(h)  =  h(x,z  ,u)  -  A     from  equation  (5.1),  which  is 

continuous  with  respect  to  the  supremum  norm  and  where  it  is  straightforward  to  show  that 

ji^ _ 

there  is     (3       such  that     p   (x,z  ,u)'p     -  X     is  bounded  away  from  zero  but 
K.  IK 

E[T(w){p^(w)'pj^}^]  -^  0. 

The  next  condition  restricts  the  allowed  rates  of  growth  for     K     and     L. 

/-^^~s/d  ^        /-u  -s  /d  ^      i-  .  /,,9,         ,^8,  2       ,,6,  3 

Assumption  8:     vnK  — >  0,     vnL     i     i  — >  0,     for  power  series,      (KL  +  KL     +KL     + 

K^L^)/n  -^  0,     and  for  splines     (K^L  +  k'^L^  +  K^^L"^  +  KL'^)/n  — >  0. 


One  can  show  that  this  condition  requires  that     s/d  >  5/2     and     s  /d    >  2,     as  was 
pointed  out  by  a  referee.     It  also  limits  the  growth  rate  of     K     and     L     so  that  if  each 
grows  at  the  same  rate  they  can  grow  no  faster  than     n  for  splines  and     n  for 

power  series.     These  conditions  are  stronger  than  for  single  step  series  estimators 
discussed  in  Newey  (1997),   because  of  complications  due  to  the  multistage  nature  of  the 
estimation  problem. 

The  next  condition  is  useful  for  the  estimator     V     of  the  asymptotic  variance  to 
have  the  right  properties. 

Assumption  9:     Either     a)     x     is  univciriate,  if  a  spline  is  used  it  is  at  least  a 

1— s  K 

quadratic  one,  and     v^TK        — >  0;     b)     x     is  multivariate,     p   (w)     is  a  power  series, 

h  (w)     is  differentiable  of  all  orders,  there  is  a  constant     C     with  the  absolute  value 

of  the     jth     derivative  bounded  above  by     C(C)  ,     and     v^iK       — >  0     for  some     e  >  0. 


This  assumption  helps  guarantee  that     -^^(u)     and  its  derivative  can  be  approximated  by 
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p   (w),     which  is  important  for  consistency  of  the  covariance  matrix  estimator.     These 

conditions  are  not  very  general,  but  more  general  ones  would  require  series  approximation 

rates  for  derivatives,  which  are  not  available  in  the  literature.     This  assumption  is  at 

least  useful  for  showing  that  the  asymptotic  -variance  estimator  will  be  consistent  under 

some  set  of  regularity  conditions. 

Next,  for  the  nonnegative  integer     5     of  Assumption  7,   let     C,AK)  = 

o 

u  K 

max,    .    -sup„iJ5^p   i.w)\\. 

Theorem  5.1:     If  Assumptions  1-3,  5,  either  6  or  7,  and  8  and  9  are  satisfied,     then     Q  = 

Qn  +  0   (C,JK)/Vn)     and 
0         p     6 


Vnv^^'^<e  -  e^)  -^  N(o,i),    Vnv'^''(e-e^)  -^  n(o,i). 


Furthermore,  if  Assumption  6  is  satisfied. 


Vn(e  -  e^)  -^  N(0,V),     7-^1^. 


In  addition  to  the  asymptotic  normality  needed  for  large  sample  confidence  intervals  this 

results  gives  convergence  rate  bounds  and,  when  Assumption  6  holds,  •v^-consistency. 

Bounds  on  the  convergence  rate  can  also  be  derived  from  the  results  shown  in  Newey  (1997) 

that  for  power  series     ^^.(K)  ^  CK  and  for  splines     C^-^K)  ^  CK  , where     C     is  a 

o  o 

generic  positive  constant.  Thus,  for  example,  when  a(h)  =  h(x,z  ,u)  -  A,  where  5  =  0, 
an  upper  bound  on  the  convergence  rate  for  power  series  would  be  K/\^  and  for  splines 
would  be     VK/Vn. 

When  Assumption  9  b)  is  satisfied,  so  that     h  (w)     has  derivatives  of  all  orders, 
and     n  (z)     also  satisfies  the  same  condition,  the  convergence  rate  can  be  made  close  to 
1/v^    by  letting     K     grow  slowly  enough.     In  particular,     h  (w)     and     n^(z)     are 
continuously  differentiable  of  order     s     for  every     s  >  0,     so  that  if     K  =  Cn       cind     L  = 
Cn       for  some  positive,  small     e,     all  of  the  conditions  will  be  satisfied.     Then  the 
the  convergence  rate  for     a(h)  =  h(x,z  ,u)  -  X    for  power  series  will  then  be     n  ~      *  , 
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which  will  be  close  to     l/Vn     for     €     small  enough. 

The  centering  of  the  limiting  distribution  at  zero  in  this  result  is  a  reflection  of 
"over  fitting"  that  is  present,   meaning  that  the  bias  of  these  estimators  shrinks  faster. 

than  the  variance.     This  over  fitting  is  implied  by  Assumption  8.     The  order  of  the  bias 

-s/d 
of  the  estimator  is     K         ,     so  that  Assumption  8  requires  that  the  bias  shrink  faster 

than     1/v^.     Since     1/v^     is  generally  the  fastest  the  standard  deviation  of 

estimators  can  shrink  (such  as  the  sample  mean),   overfitting  is  present. 

This  feature  of  the  asymptotic  normality  result  that  is  shared  by  other  theorems  for 

series  estimators,   as  in  Andrews  (1991)  and  Newey  (1997).     When  combined  with  the 

(X 

condition  that     K     go  to  infinity  slower  than  n       for     0  <  a  <  1,     the  bias  shrinking 
faster  than     1/Vn!     means  that  the  convergence  rate  for  the  estimator  is  bounded  away  from 
the  optimal  rate.     This  is  different  than  asymptotic  normality  results  for  other  types  of 
nonparametric  estimators,   such  as  kernel  regression.      It  would  be  good  to  relax  this 
condition,   but  that  is  an  extension  that  is  beyond  the  scope  of  this  paper. 


6.        Additive  Semiparametric  Models 

In  economic  applications  there  is  often  a  large  number  of  covariates  thereby  making 
nonparametric  estimation  problematic.     The  well  known  curse  of  dimensionality  can  make 
the  estimation  of  large  dimensional  nonparametric  models  difficult.     One  approach  to  this 
problem  is  to  restrict  the  model  in  some  way  while  retaining  some  nonparametric  features. 
The  single  index  model  discussed  earlier  is  one  example  of  such  a  restricted  model. 
Another  example  is  a  model  that  is  additive  in  some  components  and  a  parametric  in 
others.     This  type  of  model  has  received  much  attention  in  the  literature  as  an  approach 
to  dimension  reduction.     Furthermore,  it  is  particularly  easy  to  estimate  using  a  series 
estimator,  by  simply  imposing  restrictions  on  the  approximating  functions. 

To  describe  this  type  of  model,  let     w    =  (x,z  )     as  before,  and  let     w  ., 
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(j  =  0,1,..., J),     denote     J+1     subvectors  of     w  .     A  semiparametric  additive  model  is 

(6.1)  go(w^)  =  w^^^'  +  IjIigj^w^J. 

I' 

This  model  can  be  estimated  by  specifying  that  the  functions  of     x     included  in     p    (w) 

consist  of  the  elements  of     w         and  power  series  or  splines  in     w       (j=l,...,J).     One 

could  also  impose  similar  restrictions  on     A(u)     and  the  reduced  form,  by  specifying  that 

they  depend  on  a  linear  combination  plus  some  additive  components.     For  example,  one 

could  specify  a  partially  linear  reduced  form  as     TT„(z)  =  z'lr^  +  V       ,11    (z    ).     This 
^  0  0  0       '^m=l  m    m 

restriction  could  be  imposed  in  estimation  by  specifying     r   (z)     to  include  the  elements 

of     z„     and  power  series  or  splines  in  each     z    . 
0  m 

The  coefficients     z     of  equation  (6.1)  are  examples  of  functionals  of     h(w)  = 

gp^(w  )   +  A   (u)     that  are  mean-square  continuous  functionals  of     h,     so  that  a  series 

estimator  will  be  VlT-consistent  under  the  conditions  of  Section  5.     Let     q(w)     be  the 

residual  from  the  mean-square  projection  of     w         on  functions  of  the  form 

y.  ,g.(w,  .)  +  A„(u),     for  the  probability  measure  where     P{d)  =  E[T(w)l(w6i4)]/E[T(w)]. 
J=l  J     Ij  0 

Assume  that     E[T(w)q(w)q(w)' ]     is  nonsingular,   an  identification  condition  for     -y.     Then 

(6.2)  r  =  E[T(w)y(w)h(w)],     i;(w)  =  (E[T(w)q(w)q(w)' ])~^q(w). 


Hence,  \^-consistency  and  asymptotic  normality  of  the  series  estimator  for     y     will 
follow  from  the  results  of  Section  5.     Robinson  (1988)  also  gave  some  instrumental 
variable  estimators  for  a  semiparametric  model  that  is  linear  in  endogenous  variables. 
The  regularity  conditions  of  Section  5  can  be  weakened  somewhat  for  this  model. 
Basically,  only  the  nonparametric  part  of  the  model  need  satisfy  these  conditions  so 
that,  for  example,     w        need  not  be  continuously  distributed. 
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7.        Empirical  Example 

To  illustrate  our  approach  we  investigate  the  empirical  relationship  between  the 
hourly  wage  rate  and  the  number  of  annual  hours  worked.     While  early  examinations  treated 
the  hourly  wage  rate  as  independent  of  the  number  of  hours  worked  recent  studies  have 
incorporated  an  endogenous  wage  rate,   (see,  for  example,  Moffitt  1984,  Biddle  and  Zarkin 
1989,  and  Vella  1993),   and  have  uncovered  a  non-linear  relationship  between  the  wage  rate 
and  annual  hours.     The  theoretical  underpinnings  of  this  relationship  can  be  assigned  to 
an  increasing  importance  of  labor  when  it  is  involved  in  a  greater  number  of  hours  (see  Oi 
1962).     Alternatively  Barzel   (1973)  argues  that  labor  is  relatively  unproductive  at  low 
hours  of  work  due  to  related  start  up  costs.     Moreover,  at  high  hours  of  work  Barzel 
argues  that  fatigue  decreases  labor  productivity.     These  forces  will  generate  hourly  wage 
rates  that  initially  increase  but  subsequently  decrease  as  daily  hours  work  increase. 
Finally,   the  taxation/labor  supply  literature  argues  that  a  non-linear  relationship  may 
exist  due  to  the  progressive  nature  of  taxation  rates.     Thus  the  non-linear  relationship 
between  hours  and  wage  rates  is  potentially  the  outcome  of  many  influences  and  we  capture 
it  in  the  following  model: 


(7.1)  y.  =  z;./3  H-  g2o(x.)  +  e.,       x.  =  z'.r  +  u.. 


where     y.     is  the  log  of  the  hourly  wage  rate  of  individual     i,     z  .  is  a  vector  of 
individual  characteristics,     x.     is  annual  hours  worked,     z.     is  a  vector  of  exogenous 
variables  that  includes     z  .,     p     and     z     are  parameters,     g„_     is  an  unknown  function, 
and     e.     and     u.     are  zero  mean  error  terms  such  that     E[e|u]  *  0.     This  is  a 
semiparametric  version  of  equation  (1.1),  like  that  discussed  in  Section  6,  with  a 
parametric  reduced  form.     We  use  this  specification  because  there  are  many  individual 
characteristics  in     z.,     so  that  it  would  be  difficult  to  apply  fully  nonparametric 
estimation  due  to  the  curse  of  dimensionality.     We  estimate  this  model  using  data  on 
males  from  the  1989  wave  of  the  Michigan  Panel  Survey  Of  Income  Dynamics.     To  preserve 
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comparability  with  previous  empirical  studies  we  follow  similar  data  exclusions  to  those 
in  Biddle  and  Zarkin  (1989).     Accordingly,  we  only  include  males  aged  between  22  and  55 
who  had  worked  between  1000  and  3500  hours  in  the  previous  year.     This  produced  a  sample 
of  1314  observations.       Our  measures  of  hours  and  wages  are  annual  hours  worked  and  the 
hourly  wage  rate  respectively. 

We  first  estimate  the  wage  equation  by  linear  OLS  and  report  the  relevant  parameters 
in  column  1  of  Table  1.     The  hour  effect  is  small  and  not  significantly  different  from 
zero.     Adjusting  for  the  endogeneity  of  hours  via  linear  two  stage  least  squares  (2SLS), 
however,   indicates  that  the  impact  of  hours  is  statistically  significant  although  the 
effect  is  small.     The  results  in  column  2,   on  the  basis  of  the  t-statistic  for  the 
residuals,   suggests  hours  are  endogenous  to  wages. 

To  allow  the     g         function  to  be  non-linear  we  employed  alternative  specifications 
for     g         and  various  approximations  for  the  manner  in  which  we  account  for  the 
endogeneity  of  hours.     We  also  allow  for  some  non-linearities  in  the  reduced  form  by 
including  non-linear  terms  for  the  experience  and  tenure  variables.     We  first  choose  the 
number  of  non-linear  terms  in  the  first  step  and  then,  by  employing  the  associated 
residuals  from  the  chosen  specification,  we  determine  the  number  of  approximating  terms 
in  the  primary  equation.     We  discriminate  between  alternative  specifications  on  the  basis 
of  cross  validation  (CV)  criterion.     The  CV  criterion  is  well  known  to  minimize 
asymptotic  mean-square  error  when  the  first  estimation  is  not  present  so  we  are  hopeful 
that  it  will  lead  to  estimates  with  good  properties  here.     Below  we  consider  its 
properties  in  a  Monte  Carlo  study,   and  find  that  CV  performs  well. 

Because  the  theory  requires  over  fitting,  where  the  bias  is  smaller  than  the 
variance  asymptotically,  it  seems  prudent  to  consider  specifications  with  more  terms  than 
those  that  CV  gives,  particularly  for  inference.     Accordingly,  for  both  steps  we  identify 
the  number  of  terms  that  minimizes  the  CV  criterion  and  then  add  an  additional  term.  For 
example,  while  the  approximation  which  minimized  the  CV  criterion  for  the  first  step  was 
a  fourth  order  polynomial  in  tenure  and  experience  we  generate  the  residuals  from  a  model 
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which  includes  a  fifth  order  polynomial  in  these  variables.     Note  that  to  enable 
comparisons,   these  were  the  residuals  which  were  included  in  column  2  discussed  above. 

The  CV  values  for  a  subset  of  the  second  step  specifications  examined  are  reported 
in  Table  2.     Although  we  also  experimented  with  splines,  in  each  step,   we  found  that  the 
specifications  involving  polynomial  approximations  appeared  to  dominate.     The  preferred 
specification  is  a  fourth  order  polynomial  in  hours  while  accounting  for  the  endogeneity 
through  a  third  order  polynomial  in  the  residuals.     The  CV  criterion  for  this 
specification  is     182.643.     The  fourth  column  of  Table  1  reports  the  estimates  of  the 
hours  profile  employing  these  approximations.     The  third  column  presents  the  hours 
coefficients  while  excluding  the  residuals.      Due  to  the  over  fitting  requirement  we  now 
take  as  a  base  case  the  specification  with  a  fifth  order  polynomial  in  hours  and  a  fourth 
order  polynomial  in  the  residual.     We  will  also  check  the  sensitivity  of  some  results  to 
this  choice. 

While  Table  1  suggests  the  non-linearity  and  endogeneity  are  statistically  important 
it  is  useful  to  examine  their  respective  roles  in  determining  the  hours  profile.  For 
comparison  purposes  Figure  1  plots  the  change  in  the  log  of  the  hourly  wage  rate  as 
annual  hours  increase,  where  the  nonparametric  specification  is  the  over  fitted  case  with 
h  =  5     and     r  =  4.     To  facilitate  comparisons  each  function  has  been  centered  at  its  mean 
over  the  observations,  and  the  graph  shows  the  deviation  of  the  function  from  its  mean. 
As  annual  hours  affect  the  log  of  the  hourly  wage  rate  in  an  additive  manner  we  simply 
plot  the  impact  of  hours  on  the  log  of  the  hourly  wage  rate.     In  Figure  1  we  plot  the 
quadratic  2SLS  estimates  like  those  of  Biddle  and  Zarkin  (1989)  and  the  profile  is  very 
similar  to  theirs.     In  Figure  1  we  also  plot  the  predicted  relationship  between  hours  and 
wages  for  these  data  with  and  without  the  adjustment  for  endogeneity  using  our  base 
specification.     It  indicates  the  relationship  is  highly  non-linear.     Furthermore,  failing 
to  adjust  for  the  endogeneity  leads  to  incorrect  inferences  regarding  the  overall  impact 
of  hours  on  wages  and,  more  specifically,  the  value  of  the  turning  points. 

Although  we  found  that  the  polynomial  approximations  appeared  to  best  fit  the  data 
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we  also  identified  the  best  fitting  spline  approximations.     We  employ  cubic  splines  and 
the  parameters  we  allow  to  be  chosen  by  the  data  are  the  number  of  join  points.   We  do 
this  for  the  first  and  second  steps  and  in  this  instance. we  over  fit  the  data  by 
including  an  additional  join  point,   in  each  step,   above  what  is  determined  on  the  basis 
of  the  CV  values.     Note  that  the  best  fitting  approximation  involved  1  join  points  in  the 
reduced  form  and  a  primary  specification  of  1  join  point  in  the  hours  function  and  1  join 
points  for  the  residuals.      In  Figure  2  we  plot  the  predicted  relationship  between  hours 
and  the  wage  rate  from  the  over  fitted  spline  approximation.     This  predicted  relationship 
looks  remarkably  similar  to  that  in  Figure  1.     Furthermore,  the  points  noted  above 
related  to  the  failure  to  account  for  the  endogeneity  are  similar. 

Figures  1  and  2  indicate  that  at  high  numbers  of  hours  the  implied  decrease  in  the 
wage  rate  is  sufficient  to  decrease  total  earnings.     This  may  be  due  to  the  small  number 
of  observations  in  this  area  of  the  profile  and  the  associated  large  standard  errors. 
Figures  3  and  4  explore  this  issue  by  plotting  the  95  percent  confidence  intervals  for 
our  adjusted  nonparametric  estimates  of  the  wage  hours  profile.     It  confirms  that  the 
estimated  standard  errors  are  large  at  the  upper  end  of  the  hours  range. 

To  quantify  an  average  measure  of  the  effect  of  hours  on  wages  we  consider  the 
weighted  average  derivative  of  the  estimated     g         function.     We  estimated  the  derivative 
over  the  range  where  the  function  is  shown  to  be  upward  sloping  in  Figures  1  and  2.   This 
region,     1300     to     2500     hours,  represents     81  percent     of  the  total  sample.     There  were 
only     125     observations  above     2500,     so  that  not  very  many  were  excluded  by  ignoring  the 
upper  tail.     The  average  derivative  for  the     1300-2500     part  of  the  hours  profile  from 
the  polynomial  approximation,  based  on     h  =  5     and     r  =  4,   is     .000513     with  a  standard 
error  of     .000155.     The  corresponding  estimate  from  our  spline  approximation  of  the 
weighted  average  derivative  is     .000505     with  a  standard  error  of     .000147.     These 
estimates  are  quite  different  than  the  linear  2SLS  estimate  reported  in  Table  1  which  is 
consistent  with  the  presence  of  nonlinearity.     Furthermore,  for  the  quadratic  2SLS  in 
Figure  1  the  average  derivative  is     .000350     with  a  standard  error  of     .000142,     which  is 


32 


also  quite  different  than  the  average  derivative  for  the  nonparametric  estimators.     This 
difference  is  present  despite  the  relatively  wide  pointwise  confidence  bands  in  Figures  3 
and  4.     Thus,   in  our  example  the  average  wage  change  for  most  of  the  individuals  in  the 
sample  is  found  to  be  substantially  larger  for  our  nonparametric  approach  than  by  linear 
or  quadratic  2SLS. 

Before  proceeding  we  examined  the  robustness  of  these  estimates  of  the  average 
derivative  to  the  specification  of  the  reduced  form  and  the  alternative  specifications  of 
the  reduced  form  and  alternative  approximations  of  the  wage  hours  profile.     First,  we 
reduced  the  number  of  approximating  terms  in  the  reduced  form.     While  we  found  that  the 
CV  criteria  for  each  step  increased  with  the  exclusion  of  these  terms  in  the  reduced  form 
we  found  that  there  was  virtually  no  impact  on  the  total  profile  and  a  very  minor  effect 
on  our  estimate  of  the  average  derivative  of  the  profile  for  the  region  discussed  above. 
For  example,  consider  the  estimates  from  the  polynomial  approximation.     When  we  included 
only  linear  terms  in  the  reduced  form  the  estimated  average  derivative  is     .000515     with 
a  standard  error  of     .0000175.     The  corresponding  values  for  the  specification  including 
quadratic  terms  is     .000546     with  a  standard  error  of     .000174.     Finally,  the  addition  of 
cubic  terms  resulted  in  an  estimate  of     .000522     with  a  standard  error  of     .000153. 
Changes  in  the  specification  of  the  model  based  on  the  spline  approximations  produced 
similarly  small  differences. 

We  also  explored  the  sensitivity  of  the  average  derivative  estimate  to  the  number  of 
terms  in  the  series  approximation.     Both  the  average  derivative  and  its  standard  error 
were  relatively  unaffected  by  increasing  the  number  of  terms  in  the  approximations.     For 
example,  with  an  8th  order  polynomial  in  hours  and  7th  order  polynomial  in  the  residual 
the  average  derivative  estimate  was     .000519     with  a  standard  error  of     .000156. 

Finally  we  examine  whether  we  are  able  to  reject  the  additive  structure  implied  by 
our  model.     We  do  this  by  including  an  interaction  term  capturing  the  product  of  hours 
and  the  residuals  in  our  preferred  specification.     The  t-statistic  on  this  included 
variable  is     .231.     The  CV  value  for  this  specification  is     182.924     while  that  for  our 
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preferred  specification  plus  this  interaction  term  and  the  term  capturing  the  product  of 
hours  squared  and  the  residual  squared  is     183.188.     On  the  basis  of  these  results  we 
conclude  there  is  no  evidence  against  the  additive  structure  of  the  model. 

In  addition  to  this  empirical  evidence  we  provide  simulation  evidence  featuring  the 
data  examined  above.     The  objective  of  this  exercise  is  to  explore  the  ability  of  our 
procedure  to  accurately  estimate  the  unknown  function  in  the  empirical  setting  we 
examined  above.     We  simulate  the  endogenous  variable  through  the  exogenous  variables  and 
parameter  estimates  from  each  of  the  polynomial  and  spline  approximations  reported  above. 
We  generate  the  hours  variable  by  simulating  the  reduced  form  and  incorporating  a  random 
component  drawn  from  the  reduced  form  empirical  residual  vector.     From  this  reduced  form 
we  estimate  the  residuals  and  from  the  simulated  hours  vector  we  generate  the  higher 
order  terms  for  hours.     We  then  simulate  wages  by  employing  the  simulated  values  of  hours 
and  residuals,   along  with  the  true  exogenous  variables,   and  the  parameter  estimates 
discussed  above.     The  random  component  is  drawn  from  the  distribution  of  the  empirical 
wage  equation  residuals.     As  we  employ  the  parameter  vector  from  the  over  fitted  models 
in  the  empirical  the  polynomial  model  has  a  fifth  order  polynomial  in  the  reduced  form 
while  the  wage  equation  has  a  5th  order  polynomial  in  hours  and  a  fourth  order  polynomial 
in  the  residuals.     The  spline  approximation  employs  a  cubic  spline  with     2     join  points 
in  the  reduced  form  while  the  wage  equation  has     2     join  points  for  hours  and     2     join 
points  for  the  residuals.     We  examine  the  performance  of  our  procedure  for     3000 
replications  of  the  model. 

In  order  to  relax  the  parametric  assumptions  of  the  model  we  use  the  CV  criterion  in 
each  step  of  the  estimation.     That  is,  for  the  model  which  employs  the  polynomial 
approximation  we  first  choose  the  number  of  terms  in  the  reduced  form.     Then  on  the  basis 
of  the  residuals  from  the  over  fit  reduced  form  we  then  choose  the  number  of 
approximating  terms  in  the  wage  equation.     For  the  model  generated  by  the  spline 
approximation  we  do  the  same  except  we  employ  a  cubic  spline  and  choose  the  number  of 
join  points.     For  both  approximations  we  trim  the  bottom  and  top  two  and  a  half  percent 
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observations,   on  the  basis  of  the  hours  residuals,   from  the  data  for  which  we  estimate 
the  wage  equation. 

An  important  aspect  of  our  procedure  is  the  ability  of  the  CV  method  to  correctly 
identify  the  correct  specification.     To  examine  this  we  computed  the  CV  criteria  for  the 
polynomial  model  for  all     25     specifications  of  the  wage  equation  combining  up  to  a  fifth 
order  polynomial  in  hours  and  fifth  order  polynomial  in  residuals.     As  there  was  also  5 
choices  in  the  reduced  form  this  generated     125     possibilities.     For  the  spline  model  we 
examined  up  to     3     join  points  in  the  reduced  form  and  then  3  each  for  hours  and 
residuals  in  the  wage  equation.     This  produced     27     possible  specifications. 

From  the  Monte  Carlo  results  we  can  evaluate  the  performance  of  an  average 
derivative  estimator  like  that  we  applied  to  the  actual  data.     Once  again  we  over  fit  the 
data  such  that  we  computed  the  estimator  by  choosing  the  number  of  terms  that  minimizes 
the  CV  criterion,  plus  one  additional  term,  and  computed  standard  errors  for  the  same 
specification.     First  consider  the  results  from  the  polynomial  model.     The  mean  of  the 
average  derivative  estimates,   across  the  Monte  Carlo  replications,   was     .000456,     which 
represents  a  bias  of     8.9  percent.     The  standard  error  across  the  replications  was 
.000153,     while  the  average  of  the  estimated  standard  errors  was     .000152.     Therefore  the 
estimated  standard  errors  accurately  reflect  the  variability  of  the  estimator  in  this 
experiment.  For  the  spline  approximation  the  average  estimate  was     .000442     which 
represents  a  bias  of     11.1  percent.     The  standard  error  across  the  replications  was 
.000145,     while  the  average  of  the  estimated  standard  errors  was     .000144. 

We  also  computed  rejection  frequencies  for  tests  that  the  average  derivative  was 
equal  to  its  true  value.   For  the  polynomial  model  the  rejection  rates  at  the  nominal  5, 
10     and     20     percent  significance  levels  were     6.5,   11.8     and     22.5  percent 
respectively.     The  corresponding  figures  for  the  spline  model  were     7.4,     13.1     and     24.0 
percent.     Thus,   asymptotic  inference  procedures  turned  out  to  be  quite  accurate  in  this 
experiment,   lending  some  credence  to  our  inference  in  the  empirical  example. 
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Appendix 


We  will  first  prove  one  of  the  results  on  identification. 


Proof  of  Theorem  2.3:      Consider  differentiable  functions     6(x,z  )     and     rtu),     and 
suppose  that 


0  =  5(x,z  )  +  ^(u)  =  5(n(z)+u,z  )  +  t(u), 


identically  in     z     and     u.     Differentiating  then  gives 


0  =  n  (z)'5   (x,z  )  +  5  (x,z  ),     0  =  6   (x,z  )  +  ^   (u),     0  =  H  (z)'5    (x,z  ). 

J,  J^  i.  J.  i.  J'L  J.  LI  ^  j^  1, 


If     IT  (z)     is  full  rank  it  follows  from  the  last  equation  that  that     5   (x.z  )  =  0.     It 

then  follows  from  the  first  two  equations  that     5,(x,z, )  =  0     and     y  (u)  =  0.     Now,  to 

11  u 

show  identification  we  will  use  Theorem  2.1.     By  differentiability  of     g     and     A 
it  suffices  to  consider     5(x,z  )     and     y{u)     that  are  differentiable.     Also,   if 
6(x,z J+ytu)  =  0     with  probability  one,  then  by  continuity  of  the  functions  and  the 
boundary  having  zero  probability,   this  equality  holds  identically  on  the  interior  of  the 
support  of     (u,z).     Then,  as  above,   all  the  partial  derivatives  of     6(x,z  )     and     y{u) 
are  zero  with  probability  one,   implying  they  are  constant.     Identification  then  follows 
by  Theorem  2.1.     QED. 


To  avoid  repetition,   it  useful  to  prove  consistency  and  asymptotic  normality  lemmas 
for  a  general  two-step  weighted  least  squares  estimator.     To  state  the  Lemmas  some 
notation  is  needed.     Throughout  the  appendix     C     will  denote  a  generic  positive  constant, 
that  may  be  different  in  different  uses.     Let     X     be  a  vector  of  variables  that  includes 
X     and     z     among  its  components  and     w(X,7r)     a  vector  of  functions  of     X     and     ti,     where 
n     represents  a  possible  value  of     n  (z)  =  E[x|z].     Then  for     w  =  w(X,n  (z)),     it  will  be 
assumed  that 
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(A.l)  E[y|X]  =  h^Cw),     E[x|z]  =  Tl^U). 


Also,   let     ti.   =  0(X.)     be  a  scalar,  nonnegative  function  of     X.,     n(z)     and     t.     be  as  in 
11  1  1 

the  body  of  the  paper,  and 

(A.2)  h(w)  =  p^(w)'p,     p  =  (P'Pr^P'Y,     P  =  t^i'/'iPi-'-.^n'^nPn^''     P^  =  P^^^.). 

Let     W  =  {w  :   t(w)  =  1>,     and  for     5     a  nonnegative  integer,   let      |h|_  = 

o 

max,     ,^5,sup      ,.^|5'^h(w)|. 

Assumption  Al:      i)     0(X)     is  bounded  and     w(X,7r)     is  Lipschitz  in     tt;     ii)  each 
component     w.(X,7r)     either  does  not  depend  on     ti     or     w.(X,n   (z))     is  continuously 

distributed  with  bounded  density;     iii)  For  each     K     and  L     there  are  nonsingular 

K  K  L  L 

matrices     B     and     B      such  that  for     P   (w)  =  Bp   (w)     and     R   (z)  =  Br   (z), 

2  K  K  L         L 

E[t(w)(/»(X)  P    (w)P    (w)']     and     E[R   (z)R   (z)' ]     have  smallest  eigenvalues  that  are 

bounded  away  from  zero,   uniformly  in     K     and     L;     iv)     For  each  nonnegative  integer     5 

there  is     C,JK)     with,     max,    ,    ^sup,,,ll5^P^(w)ll   <  C^AK)     and     sup^-IIR^z)!!  r<  ^(L);  v) 

K  — cc 

There  exists     5,  a,  a,   >  0     and     p^     and     z,      such  that    |h„-p    'P^l-  ^  CK         and 

1  K.  L  U  K    o 

sup^\\Tl^{z)-y^rHz)\\  s  CL~"i. 

2 

Lemma  Al:     If  Assumptions  1  and  Al  are  satisfied  for     d  =  0,     C,^(K)  K/n  — >  0,     and 

[K^^^C,^(K)  +  (:^(K)^^(L)][(L/n)^^^  +  1'°"!]  -^  0,     then 

ST(w)iJj(X)^[h(w)-hJw)fdFjX)  =  0  (K/n  +  k"^"  +  L/n  +  L^\). 

If  Assumptions  1  and  Al  are  satisfied  for  some  nonnegative  integer     d,     then 

|h  -  h-l  ,  =  O  a:jK)[(K/n)^^^  +  k""  +  (L/n)^^^  +  l\]). 
u  a         p    a 


Proof  of  Lemma  Al:     Note  that  the  value  of     h     is  unchanged  if  a  nonsingular  constant 

K  L 

linear  treinsformation  of     p   (w)     and/or     r   (z)     is  used,  so  that  it  can  be  assumed  that 

p'^(w)  =  P^(w)     and    rhz)  =  R^(z).     Let    t.  =  t(w.),     p.  =  p^(w.),     Q  =  Elx.^^^p.p'.  ].     By 


1   1   1   1 
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Assumption  Al,  the  smallest  eigenvalue  of     Q     is  bounded  away  from  zero,   so  that  the 

largest  eigenvalue  of     Q        is  bounded.     Consider  replacing     p    (w)     by     p    (w)  = 

— 1/Z  K 
Q        p   (w).     By  the  Cauchy-Schwartz  inequality  there  is  a  constant     C     such  that 

u~K 

113  p   (w)ll   £  CCi    |(K),     so  that  the  hypotheses  and  the  conclusion  will  be  satisfied  with 

2~K       ~K 

this  replacement.     Since     E[t(w)i/'(X)  p   (w)p   (w)']  =  I     by  construction,   it  suffices  to 

prove  the  results  with     Q  =  I.     By  analogous  reasoning,   it  suffices  to  prove  the  result 
for     Q^  =  E[rhz)rhz)']  =  I. 

Next,   let     A     =  L      /Vn'  +  L     i,     H.  =  iKz.),     and     IT.  =  TT„(z.).     Also,   suppose  for 

71  11  1  0       1  '  ft- 

notational  simplicity  that  that     n(z)     is  a  scalar  function.     Note  that  for     y        from 
Assumption  Al, 

X;.!l,iin.-n.ii^/n  £  c(f-r,  )'Q,(?-r, )  +  c^;."  iin.-?-  rhz.)ii^/n 

£  Cj'[n(z)-3-j^rhz)]^dF(z)  +  C|(y-?'j^)'(Q^-IKr-9'JI    +  0(K"^"i) 

^  Cj-[n(z)-n^(z)]^dF(z)  +  CJ-lnCz)-?-,  rhz)]^dF(z)  +  Clly-r,  ll^o  (1)  +  0(K"^"i) 
0  L  L       p 

^  O(A^)   +  CWr-Tf,  ll^o   (1), 
Tt  L       p 

where  the  last  inequality  follows  by  Theorem  1  of  Newey  (1997).     Also,  by  eq.   (A. 2)  of 

2  2 

the  Appendix  in  Newey  (1997),   it  follows  that     llr-y.  II     =  0  (A   ),     so  that  by  eq.    (A.7a), 

(A.3)  y.^Am-HAl^/n  =  0  (A^). 

^1=1        1       1  P      TT 


Also,  by  Theorem  1  of  Newey  (1997), 


(A.3a)  max.^   Ilft.-n.l!  =  0  (e(L)A   ). 

i^n     11  p  ^        TT 


Therefore,  by  Assumption  Al  ii)  and  Lemma  A3, 
(A.4)  l.^.\i.-T.\/n  =  O  (?(L)A   ). 

1=1        11  p  71 


Let     P  =  [TjI/'jPj "^n^nPn^'     ^"^     ^  ^  P'P/n.     Note  that     E[IIQ-I!I^]  ^ 


38 


implying     IIQ-Ill  — ^  0.     Also,  by     W     convex  and  a  mean  value  expansion  in     w,     and 

w(X,7r)     Lipschitz  in     ti,     T.x.llp.-p.ll   :£  C^(K)llw.-w.ll   ^  C<^(K)lin.-n.ll.     Also,   by  the 

Markov  inequality,     Y.   ,T.0.llp.ll   /n  =  0  (K).     Then  by     x.   =  x.,     x.   =  x.,     and     i/i. 
^         •"     ^1=1  ri   ^1  p  1111  ^1 

bounded, 

(A.5)       IIQ-QII   £   lir.",x.x.^//^(p.p'.    -  p.p'.)/nll   +  CllX;.",(x.x.-x.)p.p'./nll 
^1=1  1  r  1  ^r  1       ^r  1  ^i=l    i  i     i  '^I'^i 

+  Clir.^Jx.x.-xJp.p'./nll   s  cy;.",x.x.(llp.-p.ll^  +  Z^.llp.mip.-p.lD/n  +  Cr   (K)^.",  |fi.-i/r.  |/n 
^1=1     1   1      1     11  ^1=1   11       11  1      1        1      1  0         ^1=1^1  ^1 

s  C<,(K)\.",lllt.-n.ll^/n  +  C(r.",x.i/(^llp.ll^/n)^^^(y.",x.x.llp.-p.ll^/n)^ 
1         ^1=1     1     1  ^1=1   11      1  ^1=1   1111 


a/2 


+  O   (C^(K)^^(L)A    )   =  O   (C,(K)^A^  +  K^^^<,(K)A      +  <^(K)^C(L)A    )   =  o   (1). 

P   ^0  71  p   ^1  TT  ^1  71  ^0  ^  71  p 

Let     r).  =  i//.[y.-h   (w.)],     t)  =  (t)  tj   ),     and     X  =  (X  ,...,X  ).     Then  by  independence 

of  the  observations     E[^.y.  |X]  =  0.E[y.  |X]  =  ^.h^(w.),     so     E[t].|X1  =  0.     Furthermore, 

11  1       1  1  0     1  1 

2  ~ 
E[t}.IX]     is  bounded  by     ip.     and     Var(y.  |X)  =  Var(y|X.)     bounded,   and  by  independence  of 

the  observations,     E[-r].Tj.|X]  =  E[tj.t)  .|X.,X.]  =  E[t}.E[t).17).,X.,X.]  |X.,X.]  = 

E[tj.E[t).|X.]|X.,X.]     =0     for     i  ^  j.     Then  by     P     depending  only  on     X, 

(A.6)  E[ll(P-P)'7)/nll^|X]  :£  Cn~^.",0^1lx.p.-x.p.ll^  <  Cn'V.'^,  |x.-x.  |  (llp.ll^+llp.ll^)/n 

^1=1   1      11     11  ^1=1     111         '^i 

+  Cn't",-^.T.IIp.-p.ll^/n  =  0  (n"^[<^(K)^?(L)A     +  r{K)hh)  =  o  (n"^), 

^^1=1    1111  P  0  TT  1  71  P 


where  the  second  to  last  equality  follows  similarly  to  eq.   (A.5).     Then,  since  a  standard 

result  is  that     Y     =  0  (A  )     if     E[|Y    ||X]  =  O  (A  ),     it  follows  that     ll(P-P)'T?/nll^ 
n         p    n  n  P    n 

=  o  (1/n).     Also,     EdlP'Vnll^]  =  E[E[IIP'7)ll^/n|X]]  =  E(y;.",x?i/»'*p'.p.  Var(y.  |X.)/n^]  £ 
p  ^1=1   r  1*^1*^1  1     1 

2 

CElx.i//.p'p.]/n  =  Ctr(Q)/n  =  Ctr(I)/n  =  CK/n.     Therefore,  by  the  triangle  inequality, 

(A.7)  llP'Vnll^  ^  C[ll(P-P)'Tj/nll^  +  llP'-rj/nll^]  =  o  (1)  +  O  (K/n)  =  0  (K/n). 

P  P  P 

Then  by  eqs.   (A.5)  and  (A.7)  and  the  smallest  eigenvalue  of     Q     bounded  away  from  zero 
with  probability  approaching  one,  for     M  =  P(P'P)~  P' ,     T)'MT)/n  =  (T)'P/n)Q~  (P'Vn)  ^ 
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0  (DIIP'Tj/nll^  =  O  (K/n).     Let     h.  =  x .\jj Mw .) ,     h.  =  x.iJj.hAw.),     h.  =  T.0.h_(w.),     and 
p  p  1  111  1  11  0      1  1  1   1  0      1 

h,     h,     and     h     be  jthe  corresponding     n  x  1     vectors  (i.e.     h  =  (h  ,...,h   )').     Let     (B 

K  -oc  ~ 

be  such  that     sup,,,|h^(w)-p   (w)'p|    =  0(K     ).     Then  by     M     and     M-I     idempotent,   (M-I)h 

o  o 

(M-I)(h-Pe),     h^<w)     Lipschitz.  and     J]-   Jin--IT.«   /n  =  O  (A   ),     for     A^,  =  VK/Vn  +  K~"  + 

0  1=1       11  P      TT  h 

Hh-h«^/n  ^  mv  +  M(h-h)  +  <M-i)hll^/n  £  C(T]'MT)/n  +  y;.",T.i/»^[h^(w.)-h„(w.)]^/n 

^1=1   1    1     0      1       0      1 

+  y.",T.^A^[h„(w.)-p'.p]^/n  £  0  (K/n)  +  C^.",  Ilfi.-n.ll^/n  +  0(K"^")  -  O  (A?). 
^1=1  11     Gil  p  ^1=1     11  p    h 


It  then  follows  by  the  smallest  eigenvalue  of     Q     bounded  av^ay  from  zero  with  probability 
approaching  one  that,  for     h  =  P/3     and     y  =  (x  i/»  y  ,...,t  \p  y  )', 

llp-/3ll^  s  0  (l)(p-p)'Q(p-/3)  =  0  (l)(y-h)'M(y-h)/n  ^  0  (l)[T)'M7]/n  +  (h-h)'M(h-h) 

+  (h-E3'M(h-h)]/n  s  0  (K/d)  +  0  (DY."  £.i//^[h^(w.)-h„(w.)]^/n 

p  p     ^^1=1  r  1     0     1      0     1 


+  O  (l)j:.",T.0^[h^(w.)-p'.p]^/n  =  0  {Ah. 
p     ^1=1  1    1     0      1       1  P     h 


Then  by  the  triangle  inequality, 

UT(w)!/»(X)^[h(w)-hQ(w)]^dFj^(X)}^''^  ^  ■{jT{w).//(X)^[p^(w)'  (p-/3)]^dFQ(X)>^^^ 
+  {jT(w)0(X)V(w)'l3-hQ(w)]^dF(w)}^'^^  -   lip-pll   +  0{K"")  s  0  (Aj^), 


giving  the  first  conclusion.     Also,  by  the  triangle  inequality, 


|h(w)-hQ(w)|g  s   |p^(w)'p-hQ(w)|g  +   |p^(w)'(p-i3)|^ 


^  0(K  ")  +  Cg(K)llp-pll   =  0  «g(K)Aj^).     QED. 


To  state  Lemma  A2,   some  additional  notation  is  needed.     Let     Z  ,     S  ,     Q  ,     and     Q      be 

K  L 

as  given  in  the  body  of  the  paper.     Also,  let     p.  =  p   (w.),     r.  =  p   (z.), 
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±  =  y.",T.i/»^p.p'.[y.-h(w.)]^/n,     2  =  E[T.t//^p.p'.Var(y.  IXJI, 
^1=1  1111     1         1  r  r  r  1         -^i     i 

Q  =  P'P/n,     Q  =  E[T.^^p.p'.  ], 

1111 

H  =  y.",T.i//.p.{[Sh(w.)/5w]'aw(X.,fT.)/57r®r'.  }/n, 
^1=1   111  1  11  1 

H  =  E[T.^/(.p.<[5h„(w.)/aw]'5w(X.,n.)/aTr®r'.>]. 
ill        0     1  11  1 

V  =  AQ~^(f:  +  HQ~^f  ^Q^^H'  )Q"^A'  ,     V  =  AQ"^Z  +  HQ^^Z^Q^^H'  )Q"^A'  . 

For  notational  convenience,     K     and     L     subscripts  for     V     are  suppressed. 

Assumption  A2:     i)      lla(h)ll   ^  C|h|„;     ii)     h^(w)     and     w(X,Tt)     are  twice  continuously 

o  U 

differentiable  in     w     and     ir     respectively,  and  the  first  and  second  derivatives  are 

2 
bounded;     iii)  either  a)  there  are     v(w),     p       such  that     a(h   )  =  E[T(v^r)^//(X)  vCwjh   (w)], 

K.  U  U 

a(pj^j^)  =  E[t(w).//(X)^i'(w)Pj^j^(w)],     and     E[T(w)t/((X)^lli^(w)-/3j^p^(w)ll^]  -^  0,  or;  b)     a(h)     is 

~  K         ~  Z 

a  scalar  and  there  exists     (^    )     such  that  for     h   (w)  -  p   (w)'p   ,     E[h   (w)   ]  — >  0,     and 

a(h    )     is  bounded  away  from  zero. 
K. 

Under  iii)  b),   let     d(X)  =  [ah   (w)/aw]'5w(X,n  (z))/a7t,     recall  the  definition  of     £     as 
the  set  of  limit  points  of     r   {z)'y   ,     and  let     p(z)     be  the  matrix  of  projections  of 
elements  of     T(w)^(X)y(w)d(X)     on     £,     and 

V  =  E[T(w)0(X)^i^(w)i^{w)'Var(y|X)]  +  E[p(z)Var(x|z)p{z)' ]. 
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Lemma  A2:     If  Assumptions  Al  and  A2  are  satisfied,     Var(y  \  X)     is  bounded  away  from  zero, 

^(L)     and     C,JK)     are  bounded  away  from  zero  as     L     and     K     grow,     and  each  of  the 

following  go  to  zero  as     n  —^  m.-  VnK     ,     VnL     i,  C,JK)  K  /n,  (K  L+L  KXK)  /n, 

KLi^Q(Kf^(L)^/n,     L^C,Q(K)^^(Lf/n,     and     C,^(K)^i^^(K)^(LK+L^  )/n,     then     6  =  6^  + 

0   (CXK)/Vn)     and 
p  ^d 

VnV^'^^Ce  -  e^)  -^  N(O.I),     ^^^(Q-B^)  -^  N(0,I). 

Furthermore,  if  Assumption  A2,  Hi)  a)  is  satisfied, 

VR(Q  -  Qq)  -U  N(0,V),     V  -^V. 
Proof  of  Lemma  A2:     First,   let 

A     =  L^'^^/v/n  +  l""i,     A^  =  K^^^/V^  +  K""'  +  A   .     A^,   =  ?(L)L^^^/;/n. 

TT  n  71  Ql 

Aq  =  [K^^^q(K)  +  Co(K)^C(L)]A^  +  Co(K)K^^^/v^. 

A„  =  [L^''^C,(K)  +  Cn(K)C(L)^]A     +  K^^^^{L)/V^. 
li  1  U  71 

Note  that  by     VnK""  -^  0     and     vQc""i  ^  0,     A     =  (L^''^/V^)(l  +  L"^^^vQ.~"i)  = 

7r 

1/2  1/2  1/2 

0(L      /Vn)     and     A,    =  0(K      /Vn  +  L      /Vn).     Therefore,  it  follows  from  the  convergence 

2  2 

rate  conditions  and  Cs-(K)  and  ^(L)  bounded  away  from  zero  that  ^(L)  L  /n  ^ 

o 

CL^CQ(K)^?(L)Vn  ->  0,  and 

(A.8)     K^''%  =  0([KCj(K)  +  K^''^Co(K)^^(L)]L^''^/^n  +  Cq^^JK/V^) 

=  0({k\Ci(K)^  +  KLCqCKj'^CCL)^  +  K^Co(K)^}^''^/v^)  ->  0. 
L^^^Ajj  =  0([LCj(K)  +  L^''^Co(K)C(L)^]L^^^/v^  +  L^'^\^^^^(L)/Vn) 

VnC^CKjA^  =  0(CQ(K)L/yn)  -^  0, 
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Cq(k)l^''^Cj(k)a^  =  o({Cq(k)^lc^(k)^(k+l)}^''^/v^)  -^  o. 

1/2 
It  also  follows  from  these  results  that     L      A       -^  0,     Cq(K)A     — >  0     and     C,(K)A     -^  0. 

For  simplicity  the  remainder  of  the  proof  will  be  given  for  the  scalar  .  IT  (z)     case. 

K  K 

Also,  as  in  the  proof  of  Lemma  Al  it  suffices  to  show  the  result  with     p    (w)  =  P   (w), 

r   (z)  =  R   (z),     and     Q     and     Q      equal  to  identity  matrices.     In  this  case,     V  = 

-1  2 

A[Z  +  HS  H']A'.     Let     F     be  a  symmetric  square  root  of     V     .     By     cr  (X)  =  Var(y|X) 

bounded  away  from  zero  and     Q  =  I,     Z  -  CI     is  positive  semidefinite.     Therefore, 
{A.9)  IIFAll   =  {tr[FAA'F']}^'^^  s  {tr[CFASA'F' ]>^^^  £  tr{CFVF' }^'^^  =  C. 


2  K 

Suppose  that  Assumption  A2,-  iii),   a)  is  satisfied.     Then     A  =  E[t(w)i/»(X)   u'{w)p    (w)' ]. 

Let     v^{w)  =  Ap^(w).     By     Q  =  I,     E[T(w)i/»(X)^lly(w)-i'-,(w)ll^]  ^  E[T(w)i/((X)^lli;(w)-p-,p^(w)ll^] 

-^  0.     Also  let     d(X)  =  [ah-(w)/5w]'aw(X,n-(z))/5jr,     b^,  (z)  = 

U  (J  K.l_, 

E[T(w)i//(X)d(X)yj.(w)rhz)']r^(z),     and     b,  (z)  =  E[T(w)i//(X)d(X)y(w)rhz)' ]rhz).     Since  the 

mean  square  error  (MSE)  of  a  least  squares  projection  is  no  greater  than  the  MSE  of  the 

2  2        2  2 

random  variable  being  projected,     E[llb      (z)-b   (z)ll    ]  :s  E[t(w)i/((X)  d(X)   Up   (w)-t'(w)ll    ]  s 

K.L  L  K. 

2  2 

CE[t(w)i//(X)   Up   (wj-i/'fw)!!    ]  — >  0     as     K  — >  oo.     Furthermore,  by  Assumption  A2,   ii),  b), 
K. 

2  2 

E[llb   (z)-p(z)ll    ]  — >  0     as     L  — >  co.     Then  by  boundedness  of     cr  (X)  =  Var(y|X)     and 

Var(x|z),     and  by  the  fact  that  MSE  convergence  implies  convergence  of  second  moments, 

it  follows  that 

(A.IO)  V  =  E[T(w)(^(X)^yj^{w)yj^(w)V^(X)]  +  E[bj^^(z)Var(x|z)bj^^(z)']  -^  V. 

This  shows  that     F     is  bounded.     Suppose  that  Assumption  A2,  iii)  b)  is  satisfied.     Then 

~  ~  2    1/2 

by  the  Cauchy-Schwartz  inequality,      |a(h-,)|   =   |Ap^|   s  |IAIIIip„ll   =  IIAII(E[h„(x)   ])       .     so 

K.  K.  K.  K. 

2 

that     IIAII  — >  00.     Also,  V  >  ASA'   ^  CIIAII   ,     so     F     is  also  bounded  under  Assumption  5. 

Next,  by  the  proof  of  Lemma  Al, 
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(A.ll)  IIQ-III   =  0  (A_)  =  o  (1),      IIQ  -III    =  0  (A_,)  =  o  (1). 

p    Q  P  1  p     Ql  p 

Further,   for     H  =  T.    ,T.i//.p.d(X.)r'./n,     similarly  to  Lemma  Al, 
^1=1  111       11 

(A.12)  IIH-HII   =  0  (A„)  =  o  (1), 

p    H  p 

Now,      IIQ-III  -^  0     implies  that  the  smallest  eigenvalue  of     Q     is  bounded  away  from  zero 
with  probability  approaching  one,   implying  that  the  largest  eigenvalue  of     Q        is 
O  (1),   impl3dng     IIBQ~'^II   £   IIBIIO  (1)     for  any  matrix     B.     Therefore, 

(A.13)  IIFA{Q~^-I)II   ^   IIFA(I-Q)Q~-^II   ^   IIFAIIIIQ-IIIO  (1)  -^  0. 

Also,      IIFAQ'^'^^II^  =  tr(FAQ"^A'F')  ^  CO   (l)IIFAII^  =  0  (1). 

P  P 

Next,   let     p     be  such  that      Ip'^C-)'^  -  h^(-)U  =  O   (K"°').     Then 

U  o  p 

(A.14)  IIV^a(p'^'p  -  h^)ll   £  Cv^llFII|p^(-)'3  -  h^(-)k  =  0   (V^"")  =  o  (1). 

(J  U  o  p  p 

Also,  for     h.   =  T.i/».h-(w.)     and     h  =  (h,,...,h  )', 
1  1  1  0     1  In 

(A.15)  IIFAQ"^' (h-Pp)/Vnll  s  Cv/KiiFAQ~^P'/V^llsup,,,|p^(w)'^  -  h^(w)  | 

^  CV^[tr(FAQ"^A'F')]^'^^0(K~°^)  =  0  (t/iiK"")  =  o  (1). 

P  P 

Then  by     Ap  =  a(p    'p),     a(h)  =  A3,     eqs.   {A.14)  and  (A.15),  and  the  triangle  inequality, 

for     h.   =  T.iA.h^(w.)     and     h  =  (h,,...,h  )', 
1  1  1  0     1  In 

(A.16)  v^[a(h)-a(h-)]  =  FAQ"^P'VV^  +  FAQ"^P' (h-h)/\/n  +  o  (1). 

0  p 

Let     IT  =  (n n  )',     u.  =  x.-n.,     U  =  (u  ,...,u  )',     y     be  such  that 

sup_|n  (z)-r   {z)'2r\   =  0(L     i),     and     d.  =  d(X.).     By  a  second  order  mean-value  expansion 
of  each     h(w.)     around     w.,     and  by  eq.   (B.O), 
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(A.17)  FAQ  ^P'(h-h)/v^  =  FAQ  ^Y.-^^i.ip.p.d.[-n.-Tl.]/^/R  +  p     =  FAQ^^Hq/r' U/^n 

^1=1   111    111  1 

+  FAQ~-^HQ7^R' (n-R' j-VV^  +  FAQ~V.",T.i/».p.d.[r'.y-n.l/Vn  +  p 
1  ^1=1   111    111  '^ 

ll^ll   <  C^IIFAQ~^''^IICn(K)I.'',lin.-n.ll^/n  =  O  (/SCn(K)A^)  =  o  (1), 

f  ^0        ^1=1       11  p  0  71  p 

Also,  by     d.     bounded  and     nHQ    H'      equal  to  the  matrix  sum  of  squares  from  the 

^      yv  — -^"l —  n  ^    2,^  '^     2  '^ 

multivariate  regression  of     T.i/>.p.d.     on     r.,     HQ,  H     :^  Y.   .x.ip .p.p'.d./n  ^  CQ. 

Ill  1  1  1  ^1=1  1111    1 

Therefore, 

(A.18)  llFAQ~^HQ~^R'(n-R'3')/v^ll   ^   IIFAQ'^HQ'^R'/'/nllV^-sup^lnQCzj-rhz)'?'! 

<  [trace(FAQ"^HQ~^Q^Q^^H'Q"^A'F')]^''^0(/nL~"i)  ^  CIIFAQ"^^^IIQ(^/riL~"i) 

=  0   (v/nL""i)  =  o  (1). 


Similarly, 

(A.19)  IIFAQ~V.",T.i/».p.d.[r'.3'-n.]/v/n   II   s  CIIFAQ~^''^IIO(VnL~"i)  =  o  (1). 

^1=1  111    ill  p 

Next,  note  that     E[IIR'U/V^II^]  =  tr(Z^)  £  Ctr(Ij^)  ^  L     by     E[u^|z]     bounded,  so  by 
the  Markov  inequality, 

1  /7 
(A.20)  IIR'U/V?ill   =  O  (L       ). 

P 

Also,  note  that     IIFAQ'^HQT^H   ^  0  (1)IIFAQ~-^''^II  =  0  (1).     Therefore, 

1  p  p 

(A.21)  IIFAQ~%(Q7^-I)R'U/\/nll   :s   IIFAQ"^Hq7^IIIQ -IIIIIR'U/V^II   =  O  (L^'^^A^,)  =  o  (1). 

1  11  p  Ql  p 


By  similar  reasoning,  and  by  eq.   (A. 8), 


(A.22)  IIFAQ  ^(H-H)R'U/Vnll   ^   IIFAQ  ^IIIIH-HIIIIR'U/Vnll   =  0  (A„L^^^)  =  o  (1). 

p     H  p 


Noting  also  that     HH'      is  the  population  matrix  mean-square  of  the  regression  of 
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T.0.p.d.     on     r.,     so  that     HH'    s  CI,     it  follows  that     E[I1HR' U/Vnll^]  =  tr(HZ,H')  <  CK. 
r  i*^!  1  1  1 

1/2 
Therefore,      IIHR'U/\/nll  =  0  (K       ),     and 

(A.23)  IIFA(Q"^-I)HR'U/V^II    ^    IIFAQ"^II  ll(I-Q)ll  IIHR' U/ZSlI    =  0   (A^K^'^^)   =  o   (1). 

P     Q  P 

Combining  eqs.   (A.16)-(A.23)  and  the  triangle  inequality  gives 
(A.24)  FAQ~-^P' (h-h)/v^  =  FAUR'V/Vn  +  o  (1). 

Next,  it  follows  as  in  the  proof  of  Lemma  Al  that     IIQ         (P-PJ'tj/VTTiI   = 
0  (Cq(K)^C(L)A^  +  <:^(K)^A^)  =  o  (1),     implying 

(A.25)  IIFAQ"^(P-P)'t]/v^II   ^   IIFAQ~^'^^IIIIQ~^'^^(P-P)'-n/V^II   =  o  (1). 

P 

Also,   by     E[7)|X]  =  0, 

(A.26)  E[IIFA(Q~^-I)P't7/v^II^|X]  =  tr(FA(I-Q)Q~^ZQ~^(I-Q)A'F' ) 

:£   IIFA(I-Q)Q~^II^  s   II FA  11^ II I-Q  11^0  (1), 

so  that      IIFA(Q~ -DP'V/nll   -^  0.     Then  combining  eqs.    (A.25)-(A.26)  with  eq.    (A.16)  and 
the  triangle  inequality  gives 

(A.27)  v^[a(h)-a(h^)]  =  FA{P' n/VR  +  HR'U/v^)  +  o  (1). 

0  p 

Next,  for  any  vector     <p     with     II0II  =  1     let     </i'FA[T.i/».p.T}.  +  Hr.u.]/V^  =  Z.   . 

Ill  1  11  in 

Note  that     Z.       are  i.i.d.  across     i     for  a  given     n,     E[Z.   ]  =  0,     and     Var(Z.   )  =  1/n. 
in  ''in  in 

Furthermore,  for  any     e  >  0,      IIFAII   £  C     and     IIFAHII   £  CIIFAII   £  C     by     CI-HH'      positive 
semidefinite,   so  that 

nEUdZ.    I   >  e)Z^  ]  =  n6^E[l(|Z.    |    >  e){Z.  /e)^]  £  ne"^E[lZ.    l"^] 
in  in  in  in  in 

£  Cne"^li0ll^E[llT.p.ll\[7)||X.]]  +  E[llr.ll'^E[ut |z.]]}/n^ 
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£  Cn  ^{Cq(K)^E[IIt.i/».p.II^]  +  C(L)^E[llr.ll^]}  =  0([Cq(K)^K+  ^{L)\]/n)  =  o(l). 

Therefore,     /nF(9-0    )  — >  N(0,I)     follows  by  the  Lindbergh-Feller  theorem  and  equation 

(A.27). 

2 

As  previously  noted,     CI  -  HZH'      is  positive  semi-definite,  so  that     V  £  CIIAII   . 

Suppose     a(h)     is  a  scalar.     It  follows  from  the  above  proof  that     A  ^^  0     for  all     n 

large  enough.     Note  that  for  any     IS,     the  Cauchy-Schwartz  inequality  implies      |p    'p|      ^ 

o 

CJK)II/3II,     so  that      IIAII^  =    latAp'^)!    ^    \Ap^\  ^  ^  C,AK)\\A\\.     Dividing  by     IIAII     then  gives 
o  o  o 

9  9  „  r/9 

IIAII     :£  CtK)   .     Therefore,     9-6^  =  O   (V       /Vn)  =  0  (C^(K)/\/n).     This  result  for  the 
5  Op  P     " 

scalar     a(h)     covers  the  case  of  Assumption  A2,   iii),   b).     In  the  other  case  it  follows 

from     V  ^  V     that     9-6^  =  0   (1/V^)  =  0   {C.JK)/Vn). 

0    .      p  P     5 

Next,   by  Lemma  Al,     max.      |h.-h.  |    =  0  (C„(K)A,  )  =  o  (1).     Also,   it  follows  by 

i:sn     1     1  p     0         h  p 

Theorem  1  of  Newey  (1997)  that     max.      III.-IT.  |    =  0   (f(L)A   )  =  o   (1),     so  that  by     h„(w) 

i£n     11  p  71  p  0 

and     w(X,7i)     Lipschitz  in     w     and     tt     respectively,     max.^|h.-h.|    =  o  (1).     Hence,  by  the 

triangle  inequality,     max.      |h.-h.  |    =  o   (1).     Note  that  by     x.   =  x.,     x.(7}.-t).)  = 

i:£n      11  p  1  1111 

x.[-27j.(h.-h.)+(h.-h.)^]  =  v..     Also,   for     D  =  FAQ~^P'diag{l+|T7,  | I+Itj    |  }PQ"^A'F'/n. 

111111  1  b         1^  '      '     '  'n 

E[D|X]  £  CFAQ~^A'F  =  0  (1).     Let     Z  =  y'.",x.p.p'.-n^/n.     Then 

p  '-'1=1  111    1 

IIFAQ~^(Z-S)Q~^A'F'II   =   IIFAQ~^P'diag{i^  v  }PQ"U'F'/nll 

£  Ctr{D)max.^    |h.-h.|    =  0  (l)o  (1)  =  o  (1). 
i^n     11  P       P  P 

Also,  by  E[t)^|X]  uniformly  bounded,  E[J]."  tj^It.-x.  |/n|X]  :s  CJ^."  |x.-t.  |/n  and 
E(y;.",T}^llw.-w.ll^/n|X]  £  cy.",llw.-w.ll^/n,  so  that  T.^.tj^Ix.-x.  |/n  =  0  (^(L)A  )  and 

^1=1     1         1        1  ^1=1        1        1  ^1=1     1        11  p^  71 

y.   ,7).llw.-w.ll  /n  =  0  (A   ).     Then  similarly  to  the  proof  of     IIQ-III  =  0  (A„), 

^1=1     111  P      TT  p      Q 

iiz-zii  £  iiy;.",(x.p.p'.-x.p.p'.)T}^/nii  +  iiy;.",T.p.p'.T)^/n  -  zii 

^1=1    r  r  1     i^r  1     i  ^i=l  ri^i    i 

4        4  1/?  9 

=  0  (A^)  +  O  ({E[x.llp.irE[7}^|X.]]/nr     )  =  0  (A^  +  C^(K)^K/n)  =  o  (1). 
pQpii  11  pQ^O  p 

Therefore,      IIFAQ~^(Z-S)Q"^A'F' II   s   IIFAQ"^II^IIZ-ZII  -^0.     Furthermore,      IIBSII  s 
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CIIBII     for  any  matrix     B,     by     CI  -  E     positive  semi-definite,   so  that 

IIFA(Q~^EQ~^-Z)A'F'I1   <   IIFA(Q~^-I)SQ~^A' F' II   +    IIFAZ(q"^-1)A' F' II   <   IIFAQ"^II^II(I-Q)ZII   + 

IIFAZIIIII-QIIIIFAQ~^II   £  O  (l)III-QII   +   IIFAIIo  (1)0  (1)  =  o   (1).     Then  by  the  triangle 

P  P        P  P 

inequality,   it  suffices  to  show  that     FA(Q~^HQ~^Z^Q^^H'Q"^  -  HE^H')A'F'   -^  0. 

To  show  this  last  result,  note  first  that  it  follows  similarly  to  the  argument  for 

JIE-ZII  -^  0     that      IIZ -Z,ll   -^  0.      Let     d.  =  {[ah{w.)/dw]'  av/(X.,fi.)/dn     and     d.  =  d(X.). 

11  1  1  11  11 

Then  by  the  conclusion  of  Lemma  Al,     3w(X,7r)/57r     bounded, 

^ii^i'V^i'^^".^  C(sup^|h-hQ|2  +  l.^^\\fl.-Tl//n)  =  0   (q(K)2A^). 


Therefore, 

IIH-HII   ^  C(X.",T.llp.ll^lir.ll^/n)^^^(I.",T.|d.-d.|^/n)^^^=  0  (C^(K)L^''^C{K)A^)  =  o   (1). 
^1=1   1     1  1  ^1=1  111  P     0  1         h  p 


Hence,   by  eq.   (A.  12)  and  the  triangle  inequality,      IIH-HII  — >  0.     The  conclusion  then 
follows  similarly  to  previous  arguments.     For  example,   by  logic  like  that  above, 

IIFAQ~^HQ~^f: -Z  )Q~^H'Q~^A'F'II   ^   IIFAQ~^HQ~^II^I1Z -E  II 

:£  tr(FAQ~'^HQ7^H'Q~^A'F')o  (1)  =  o  (1). 
1  p  p 

It  then  follows  by  similar  arguments  and  the  triangle  inequality  that     FVF'   — ^  I.     In 
the  case  of  Assumption  A2,   iii,   b),  where     a(h)     is  a  scalar,   it  follows  by  taking  a 
square  root  that  that     v"^'^^/V~^'^^  -^  1,     so  that     VnV'^^he-Q   )  = 

(V        /V         )\^nV         (^"^n^  — ^  N(0,I).     In  the  other  case  the  last  conclusion  follows 

— ^1/2 
similarly  from     F  — >  V         .     QED. 


Lemma  A3:     If     v.     is  i.i.d.  and  continuously  distributed  with  bounded  density  and 

max.^    Iv.-v.  I   =  0  (5  ),     6     -^  0.     then     Y.'^ AKasv .^b)-l(ar^v .:^b)\/n  =  0  (S  ). 
i^n     11  P    n  n  ^i=2  i  i  P     n 


Proof:     By  the  density  of     v.     bounded,  for  any     A     >  0,     A     — >  0,     by  the  Markov 

1  n  n 

inequality. 
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y.",[l(|v.-a|£A   )+l(|v.-b|^A   )]/n  =  0  (Prob(|  v.-a|  <A   )+Prob(  |  v.-b|<A   ))  =  0  (A   ). 
^1=1         1  n  1  n  p  in  in  p    n 

We  also  use  the  well  known  result  that     Y     =  0  (1)     if  and  only  if     e  Y     -^  0     for  all 

n         p  n   n 

positive  sequences  with     c     — >  0     slowly  enough.     Consider  any  positive  sequence     e 

2  -1/2  1/2 

such  that     c       goes  to  zero  slower  than     8  ,     i.e.     (e   )         5     — >  0.     Then     v     =  (e   ) 
n  n  n  n  n  n 

— >  0,     so     V  max.      Iv.-v.  |/5     — ^  0.     It  follows  that     max.      |v.-v.  |    ^  5  /v     with 
n        i^n     1     1       n  i^n     i     i  n     n 

probability  approaching  one  (w.p.a.l)  as     n  — >  oo.     Then  w.p.a.l, 

c  7.",  |l{a<v.£b)-l(a:£v.<b)|/5  n  =  e  7.",  |  l(a£v.+[v.-v.]<b)-l(a<v.£b)  |/5  n 
n^i=l  1  in  n^i=l  i      i     i  i  n 

s  c  y.^JKIv.-alrsS  /v   )+l(|v.-b|<5  /v   )]/8  n  =  e  d~^0  (5  /v   )   =  O  (y   )  =  o  (1).     QED. 
n^i=l  1  nn  i  nnn  nnpnn  pn  p 

Proof  of  Lemma  4.1:     Because  series  estimators  are  invariant  to  nonsingular  linear 
transformations  of  the  approximating  functions,   a  location  and  scale  shift  allows  us  to 
assume  that     W  =  [0,1]         and     Z  =  [0,1]  i.     Also,   in  the  polynomial  case,  the  component 
powers  can  be  replaced  by  polynomials  of  the  same  order  that  are  orthonormal  with  respect 
to  the  uniform  distribution  on     [0,1],     and  in  the  spline  case  by  a  B-spline  basis.      In 
both  cases  the  resulting  vector  of  functions  is  a  nonsingular  linear  combination  of  the 
original  vector,  because  of  the  assumption  that  the  order  of  the  multi-index  is 
increasing.     Also,  assume  that  the  same  operation  is  carried  out  on  the  first  step 
approximating  functions     r   (z).     For  any  nonnegative  integer     j, 

<.(K)  =  max.    ,    .sup      ~J\d  p   (w)ll,     f  .(L)  =  max,    ,    .sup     .^115  p   (z)ll. 

Then  it  follows  from  Newey  (1997)  that  in  the  polynomial  case, 
<IK)  ^  CK^'^^J,     C(L)  £  CL. 

and  in  the  spline  case  that 
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It  follows  from  these  inequalities  and  Assumption  4  that 


[Cq(K)C^(K)  +  Co(K)^?q(L)]A^  ^0.     A^  =  (L/n)^^^  +  L  V^^i, 


so  the  conclusion  follows  by  Lemma  Al.      QED. 


Proof  of  Theorem  4.2:     Without  changing  the  notation  let     F"f,(w)     denote  the  conditional 
distribution  of     w     given     t(w)  =  1,     let     A(u)  -    A(u)  -  jA(u)dF   (u),     and     ^^(u)  =  ^Au) 
-  JA   (u)dF   (u).     For     w    =  (x,z  ),     note  also  that     g(w  )  =  g(w  )-J'g(w  )dF   (w  )     and 
g„(w  )  =  g„(w J-J'g   (w JdF   (w  ).      Also  by  the  density  of     w     being  bounded  and  bounded 
away  from  zero,   so  is  the  conditional  density  on     W,     and  the  corresponding  marginals. 
Therefore,   for     A  =  J{h(w)-h   (w)}dF   (w) 

J[h(w)-hQ(w)]^dFQ(w)  >  Cj'[i(w^)+A(u)-gQ(w^)-AQ(u)]^dFQ(w^)dFQ(u) 

=  CSlg(vf^)-g^iw^)  +  A(u)-Aq(u)  +  A]^dFQ(w^)dFQ(u) 

=  Cj-[i(w^)-iQ(w^)]^dFQ(w^)  +  Cj'[A(u)-AQ(u)]^dFQ(u)  +  CA^ 

i  Cmax{J[i(w^)-iQ(w^)]^dFQ(w^),  J-[A(u)-AQ{u)]^dFQ{u),  A^> 


where  the  product  terms  drop  out  by  the  construction  of     g(u),     etc.,     e.g.     S 
J[g(w^)-iQ(w^)][A(u)-AQ(u)]dFQ(w^)dFQ(u)  =  X[g(w^)-gQ(w^)]dFQ(w^)- J[A(u)-AQ(u)]dFQ(u)  = 
0.     QED. 

Proof  of  Theorem  4.3:     Follows  from  Lemma  4.1  by  fixing  the  value  of     u     at  any  point  in 
W.     QED. 

Proof  of  Theorem  5.1:     For  power  series,  using  the  inequalities  in  the  proof  of  Theorem 
4.1,   it  follows  that 

Cq(K)^K^  s  Ck"^,     (K^L+L^)C^(K)^  £  C(k\+L^)K^,     KLCq(K)'^C{L)^  ^  CK^L^, 

L^CqCKJ^^CL)"^  £  Ck\^,     Co(K)^Ci(K)^(LK+L^)  ^  CK^(KL  +  L^). 


«^n 


and  for  splines  that 


It  then  follows  from  the  rate  conditions  of  Theorem  4.2  that  the  rate  conditions  of  Lemma 
A2  are  satisfied.     The  conclusion  then  follows  from  Lemma  A2,   similarly  to  the  proof  of 
Theorem  4.L      QED. 
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OLS 

.000017 
(.5597) 


2sls 

.000387 
(2.7447) 

Table  1: 
Estimate 
NP 

-.0094 
(2.5513) 



7.01516-6 
(2.6363) 



-2.1707e-9 
(2.6304) 



2.3692e-13 
(2.5537) 

NPIV 

-.01097 
(2.9170) 

7.4577e-6 
(2.7986) 

-2.0250e-9 
(2.4640) 

1.89296-13 
(2.0311) 


-.000384 
(2.6732) 


-.0005 
(2.7643) 


r  -6.64646-8 

(.5211) 

r^  3.49786-10 

(2.7516) 

Notes:i)  h  denotes  hours  worked  and  r  is  the  reduced  form  residual. 

ii)  The  regressors  in  the  wage  equation  included  education  dummies,   union  status,   tenure, 

full-time  work  experience,  black  dummy  and  regional  variables. 

iii)  The  variables  included  in  Z  which  are  excluded  from  X  are  marital  status,  health 

status,  presence  of  young  children,  rural  dummy  and  non-labor  income. 

iv)  Absolute  value  of  t-statistics  are  reported  in  parentheses. 


Specification 

h=0;  r=0 
h=l;  r=l 
h=2;  r=2 
h=3;  r=3 
h=4;  r=4 
h=5;  r=5 


Table  2: 

Cross  Validation  Value 

185.567 
183.158 
183.738 
182.899 
182.882 
183.128 
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Estimote   of   Woge/Hours   Profile   from   Polynomial  Approximation 
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Figiire  1: 


Estimate  of  Wage/Hours   Profile  from  Spline  Approximation 
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Figure  2: 


95Z  Confidence  Interval  for  Wage/Hours   Profile  from   Polynomial 
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Figure  3: 
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95%   Confidence   Interval   for   Wage/Hours   Profile   from   Spline 
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